🤖 AI Summary
This work identifies and formally names “language blindness” in vision-language-action (VLA) models—a phenomenon where models disregard out-of-distribution (OOD) language instructions that conflict with visual priors, defaulting to visually plausible actions instead. To systematically diagnose this language-action decoupling, the authors introduce ICBench, a benchmark for quantifying grounding failures. They further propose IGAR, a training- and architecture-free inference-time attention recalibration method that dynamically reweights cross-modal attention to better align actions with linguistic intent. Experiments demonstrate that IGAR substantially reduces erroneous executions under OOD instructions across 30 LIBERO tasks and on a real Franka robotic arm, while preserving performance on in-distribution tasks, thereby enhancing the model’s language grounding robustness without compromising general capabilities.
📝 Abstract
Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.