Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Vision-Language-Action (VLA) models often generate semantically correct textual plans but exhibit action-reasoning inconsistency—i.e., low embodied Chain-of-Thought (CoT) faithfulness—under out-of-distribution (OOD) scenarios. Method: We formalize this issue for the first time and propose a training-free, runtime strategy-guidance approach: leveraging intrinsic action diversity of VLA models to construct candidate action sequences; simulating their execution outcomes; and scoring and filtering “reasoning–action” pairs using a pretrained vision-language model (VLM) for consistency. This transforms action uncertainty into a search advantage, enabling dynamic alignment between reasoning and behavior. Contribution/Results: Evaluated on the extended LIBERO-100 benchmark under semantic and visual OOD perturbations, our method improves performance on compositional manipulation tasks by up to 15%. It demonstrates strong generalization, zero-shot adaptability, and scalable resource efficiency without requiring additional training or fine-tuning.

Technology Category

Application Category

📝 Abstract

Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA's intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA's own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA's natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks and scales with compute and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla/

Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment between textual plans and actions in VLAs

Improves robustness to out-of-distribution scenarios without retraining

Enhances action selection through simulation and vision-language verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Runtime policy steering for reasoning-action alignment

Simulation-based outcome prediction for action sequences

Vision-Language Model selection of aligned action outcomes

🔎 Similar Papers

No similar papers found.