CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

πŸ“… 2026-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models rely on implicit latent features for spatial guidance, often failing to ensure the physical plausibility of generated actions. To overcome this limitation, the authors propose an explicit constraint mechanism based on sparse spatial anchors: by predicting anchor points in the form of incremental poses, they construct a tolerance corridor and integrate gradient corrections derived from this corridor into the flow-matching action head. This approach enables interpretable and physically aligned guidance of action trajectories. Notably, it is the first to incorporate explicit spatial corridors into generative action policies, achieving substantial performance gains on the LIBERO-Plus benchmarkβ€”GR00T-Corr attains a success rate of 83.21%, representing an improvement of 3.4% to 12.4% over baseline methods.

Technology Category

Application Category

πŸ“ Abstract
Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose $CorridorVLA$, which predicts sparse spatial anchors as incremental physical changes (e.g., $Ξ”$-positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by $3.4\%$--$12.4\%$ over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of $83.21\%$. These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
spatial constraints
generative action
explicit guidance
continuous control
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial anchors
explicit spatial constraints
generative action policy
flow-matching
vision-language-action
πŸ”Ž Similar Papers
No similar papers found.