ECCV 2026

PolicyTrim

Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Xianghui Wang^1,* Feng Chen^2,* Wenbo Zhang² Hua Yan¹ Zixuan Wang^1,†

Changsheng Li³ Yinjie Lei^1,‡

¹Sichuan University ²The University of Adelaide ³Beijing Institute of Technology

^*Co-first authors ^†Project lead ^‡Corresponding author

arXiv Hugging Face Code

Intrinsic Policy Inefficiency in Deployed VLA Models

Intrinsic policy inefficiency analysis — **Figure 1.** Repeated rollouts reveal large step-count variance, while forced long chunk execution can reduce success and increase physical steps, exposing unreliable tail predictions and redundant corrective actions.

Abstract

Two-Stage RL Post-Training for Policy Efficiency

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors: the effective executable length of predicted action chunks and the total physical steps required to complete a task.

We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps via dynamic horizon exploration and a redundancy-aware step-saving reward.

Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3× and reduces physical execution steps by 51.4%. Ultimately, our framework delivers up to a 5.83× end-to-end deployment speedup without compromising task success rates.

3×

Action chunk utilization improvement

51.4%

Physical execution step reduction

5.83×

End-to-end deployment speedup

Method

Two-Stage RL Post-Training Framework

Stage 1

Reliable Action Chunk Extension

A progressive reliability sweep rewards successful rollouts that sustain longer executable action chunks, pushing the trustworthy prediction frontier toward the usable chunk limit.

Stage 2

Redundancy-Aware Step Reduction

A step-saving reward favors successful task completions with fewer physical steps, while stability regularization discourages unreproducible shortcuts.

Experiments

Benchmark Results

LIBERO Benchmark

Average success rate (SR), total physical steps (S_total), average action chunk execution length (h_chunk), and end-to-end speedup (Spd↑) across four LIBERO subsets.

Task	Method	π₀.₅				OpenVLA-OFT				GR00T
Task	Method	SR	S_total	h_chunk	Spd↑	SR	S_total	h_chunk	Spd↑	SR	S_total	h_chunk	Spd↑
Spatial	Baseline	97.8	108.3	5	1.0	98.6	111.2	8	1.0	91.4	67.2	5	1.0
Spatial	PolicyTrim	97.8	59.8	15	5.43×	98.8	62.1	8	1.79×	92.0	56.6	10	2.37×
Object	Baseline	99.1	125.0	5	1.0	98.5	135.2	8	1.0	95.0	71.3	5	1.0
Object	PolicyTrim	98.5	64.3	15	5.83×	98.5	68.8	8	1.97×	95.3	65.5	10	2.18×
Goal	Baseline	98.7	110.6	5	1.0	97.7	118.6	8	1.0	84.2	63.3	5	1.0
Goal	PolicyTrim	98.8	63.5	15	5.23×	98.0	66.9	8	1.77×	86.3	60.8	10	2.08×
Long	Baseline	93.0	249.8	5	1.0	92.9	249.3	8	1.0	86.1	177.9	5	1.0
Long	PolicyTrim	93.3	171.8	10	2.91×	93.1	178.3	8	1.40×	89.2	165.9	10	2.14×

ManiSkill & Meta-World

PolicyTrim improves both success rates and step efficiency across benchmarks, reaching up to 2.52× speedup on Meta-World and 2.36× on ManiSkill.

Benchmark	Method	π₀.₅				OpenVLA-OFT
Benchmark	Method	SR	S_total	h_chunk	Spd↑	SR	S_total	h_chunk	Spd↑
ManiSkill	Baseline	88.1	45.2	5	1.0	60.6	53.1	8	1.0
ManiSkill	PolicyTrim	89.8	38.3	10	2.36×	63.2	46.7	8	1.14×
Meta-World	Baseline	65.1	66.3	5	1.0	not evaluated
Meta-World	PolicyTrim	65.4	52.6	10	2.52×

Real-World Deployment

PolicyTrim transfers its efficiency gains to physical deployment on an Agilex Piper arm, maintaining or improving success rates while achieving 1.86× average wall-clock speedup under the standard real-world setting.

Method	Std. SR			Dyn. SR		Time
Method	Flip	Hang	Tape	Flip	Tape	Flip	Hang	Tape
Baseline	70	60	95	70	65	14.6	15.6	17.5
PolicyTrim	75	65	95	70	70	7.6	8.7	9.4

Architectural Generality

PolicyTrim also generalizes beyond the standard OpenVLA-OFT setting, improving both re-pretrained parallel-decoding OpenVLA-OFT and autoregressive OpenVLA.

Model	Method	SR	Step	h	Spd↑
OpenVLA-OFT	Baseline	98.6	111.2	8	1.00×
OpenVLA-OFT	S1+S2	98.8	65.4	14	2.97×
OpenVLA	Baseline	84.7	113.5	—	1.00×
OpenVLA	S2	87.0	80.6	—	1.41×

Qualitative comparison on LIBERO tasks — **Figure 3.** Qualitative comparison on randomly sampled LIBERO tasks. Under identical configurations, the baseline incurs redundant physical actions while PolicyTrim completes tasks in fewer steps.

Real-World Comparison

Flip Mug

Baseline: 14.3s | PolicyTrim: 6.5s

Baseline

Success at 14.3s

PolicyTrim

Success at 6.5s

Hang Mug

Baseline: 15.1s | PolicyTrim: 7.4s

Baseline

Success at 15.1s

PolicyTrim

Success at 7.4s

Tape Box

Baseline: 18.6s | PolicyTrim: 7.5s

Baseline

Success at 18.6s

PolicyTrim

Success at 7.5s

Real-world rollout comparison under the same task setting. The success marker appears at the completion timestamp encoded in each video filename.

Citation

BibTeX

@inproceedings{policytrim2026,
  title     = {PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models},
  author    = {Xianghui Wang and Feng Chen and Wenbo Zhang and Hua Yan and Zixuan Wang and Changsheng Li and Yinjie Lei},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}