ECCV 2026

PolicyTrim

Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

1Sichuan University 2The University of Adelaide 3Beijing Institute of Technology
*Co-first authors Project lead Corresponding author

Intrinsic Policy Inefficiency in Deployed VLA Models

Intrinsic policy inefficiency analysis
Figure 1. Repeated rollouts reveal large step-count variance, while forced long chunk execution can reduce success and increase physical steps, exposing unreliable tail predictions and redundant corrective actions.
Abstract

Two-Stage RL Post-Training for Policy Efficiency

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors: the effective executable length of predicted action chunks and the total physical steps required to complete a task.

We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps via dynamic horizon exploration and a redundancy-aware step-saving reward.

Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by and reduces physical execution steps by 51.4%. Ultimately, our framework delivers up to a 5.83× end-to-end deployment speedup without compromising task success rates.

Action chunk utilization improvement

51.4%

Physical execution step reduction

5.83×

End-to-end deployment speedup

Method

Two-Stage RL Post-Training Framework

PolicyTrim method overview
Figure 2. PolicyTrim is a two-stage RL post-training framework: first extend the reliable action chunk horizon, then reduce redundant physical steps with a step-saving objective.

Stage 1

Reliable Action Chunk Extension

A progressive reliability sweep rewards successful rollouts that sustain longer executable action chunks, pushing the trustworthy prediction frontier toward the usable chunk limit.

Stage 2

Redundancy-Aware Step Reduction

A step-saving reward favors successful task completions with fewer physical steps, while stability regularization discourages unreproducible shortcuts.

Experiments

Benchmark Results

LIBERO Benchmark

Average success rate (SR), total physical steps (Stotal), average action chunk execution length (hchunk), and end-to-end speedup (Spd↑) across four LIBERO subsets.

Task Method π₀.₅ OpenVLA-OFT GR00T
SRStotalhchunkSpd↑ SRStotalhchunkSpd↑ SRStotalhchunkSpd↑
SpatialBaseline 97.8108.351.0 98.6111.281.0 91.467.251.0
PolicyTrim 97.859.8155.43× 98.862.181.79× 92.056.6102.37×
ObjectBaseline 99.1125.051.0 98.5135.281.0 95.071.351.0
PolicyTrim 98.564.3155.83× 98.568.881.97× 95.365.5102.18×
GoalBaseline 98.7110.651.0 97.7118.681.0 84.263.351.0
PolicyTrim 98.863.5155.23× 98.066.981.77× 86.360.8102.08×
LongBaseline 93.0249.851.0 92.9249.381.0 86.1177.951.0
PolicyTrim 93.3171.8102.91× 93.1178.381.40× 89.2165.9102.14×

ManiSkill & Meta-World

PolicyTrim improves both success rates and step efficiency across benchmarks, reaching up to 2.52× speedup on Meta-World and 2.36× on ManiSkill.

Benchmark Method π₀.₅ OpenVLA-OFT
SRStotalhchunkSpd↑ SRStotalhchunkSpd↑
ManiSkillBaseline 88.145.251.0 60.653.181.0
PolicyTrim 89.838.3102.36× 63.246.781.14×
Meta-WorldBaseline 65.166.351.0 not evaluated
PolicyTrim 65.452.6102.52×

Real-World Deployment

PolicyTrim transfers its efficiency gains to physical deployment on an Agilex Piper arm, maintaining or improving success rates while achieving 1.86× average wall-clock speedup under the standard real-world setting.

Method Std. SR Dyn. SR Time
FlipHangTape FlipTape FlipHangTape
Baseline 706095 7065 14.615.617.5
PolicyTrim 756595 7070 7.68.79.4

Architectural Generality

PolicyTrim also generalizes beyond the standard OpenVLA-OFT setting, improving both re-pretrained parallel-decoding OpenVLA-OFT and autoregressive OpenVLA.

Model Method SR Step h Spd↑
OpenVLA-OFTBaseline 98.6111.281.00×
OpenVLA-OFTS1+S2 98.865.4142.97×
OpenVLABaseline 84.7113.51.00×
OpenVLAS2 87.080.61.41×
Qualitative comparison on LIBERO tasks
Figure 3. Qualitative comparison on randomly sampled LIBERO tasks. Under identical configurations, the baseline incurs redundant physical actions while PolicyTrim completes tasks in fewer steps.
Real-World Comparison

Real-World Comparison

Flip Mug

Baseline: 14.3s | PolicyTrim: 6.5s

Baseline

Success at 14.3s

PolicyTrim

Success at 6.5s

Hang Mug

Baseline: 15.1s | PolicyTrim: 7.4s

Baseline

Success at 15.1s

PolicyTrim

Success at 7.4s

Tape Box

Baseline: 18.6s | PolicyTrim: 7.5s

Baseline

Success at 18.6s

PolicyTrim

Success at 7.5s

Real-world rollout comparison under the same task setting. The success marker appears at the completion timestamp encoded in each video filename.

Citation

BibTeX

@inproceedings{policytrim2026,
  title     = {PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models},
  author    = {Xianghui Wang and Feng Chen and Wenbo Zhang and Hua Yan and Zixuan Wang and Changsheng Li and Yinjie Lei},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}