When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the susceptibility of Vision-Language-Action (VLA) models to dataset biases under weak supervision, which often leads them to rely on visual shortcuts and ignore language instructions, resulting in failure in counterfactual scenarios. To mitigate this, the authors propose Counterfactual Action Guidance (CAG), a training-free, plug-and-play method that enhances language adherence through a dual-branch inference architecture. During inference, CAG explicitly contrasts the standard VLA policy with a language-agnostic Vision-Action (VA) policy. The study also introduces LIBERO-CF, the first counterfactual evaluation benchmark for VLAs. Experiments show that CAG improves language-following accuracy by 9.7% and task success rate by 3.6% on LIBERO-CF; when combined with a VA model, these gains increase to 15.5% and 8.5%, respectively. Furthermore, CAG boosts average task success by 17.2% in real-world settings.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

counterfactual failures

language grounding

visual shortcuts

instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Failure

Vision-Language-Action Models

Language Grounding