CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Current vision-language-action (VLA) models struggle to accurately comprehend and execute fine-grained natural language instructions, primarily due to semantically homogeneous language annotations in robotic datasets, low task discriminability, and insufficient fine-grained language grounding for similar visual observations. To address this, we propose a counterfactual relabeling method leveraging pre-trained vision-language models (VLMs), which automatically generates semantically diverse, task-specific counterfactual language descriptions and corresponding action labels—without collecting new data—thereby substantially enriching linguistic variability and task granularity in existing datasets. A VLA policy trained on the relabeled data achieves a 27% improvement in success rate across three indoor and outdoor vision-language navigation benchmarks, setting new state-of-the-art performance. This work introduces, for the first time, counterfactual generation for VLA data augmentation, establishing a low-cost, scalable paradigm for enhancing instruction-following capability.

Technology Category

Application Category

📝 Abstract

Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting visual language navigation experiments in 3 different indoor and outdoor environments. Our experiments demonstrate that counterfactual relabeling, without any additional data collection, significantly improves instruction-following in VLA policies, making them competitive with state-of-the-art methods and increasing success rate by 27% on navigation tasks.

Problem

Research questions and friction points this paper is trying to address.

VLA models struggle with fine-grained instruction following

Lack of semantic diversity in existing robot datasets

Need improved language grounding for complex navigation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging vision language models for counterfactual labels

Augmenting robot datasets without additional data collection

Increasing language diversity and granularity for grounding

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling