CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

πŸ“… 2026-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

219K/year
πŸ€– AI Summary
This work addresses the poor real-time performance of existing flow-based vision-language-action (VLA) policies, which suffer from multi-step reasoning that compromises both efficiency and effectiveness. To overcome this limitation, we propose CF-VLA, a novel coarse-to-fine two-stage action generation framework. The approach first models the terminal velocity via a conditional posterior to construct a structured initial action trajectory, replacing conventional random noise initialization. This is followed by a single-step fine-grained refinement stage that corrects residual errors. Coupled with a staged training strategy, CF-VLA significantly enhances sampling efficiency and action quality under low NFE (Number of Function Evaluations). Evaluated on the CALVIN and LIBERO benchmarks, our method substantially outperforms prior approaches, reducing action sampling latency by 75.4% and achieving an average success rate of 83.0% on real-world robotic tasks.

Technology Category

Application Category

πŸ“ Abstract
Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $Ο€_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $Ο€_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
action generation
flow-based models
inference efficiency
real-time constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

coarse-to-fine generation
vision-language-action
flow-based policy
efficient inference
structured initialization