Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the misalignment between language and action in existing hierarchical vision-language-action (VLA) models, which undermines transparency in human-robot collaboration due to the absence of explicit grounding between linguistic descriptions and motion trajectories. To resolve this, we propose a novel training framework that introduces, for the first time in hierarchical VLA architectures, an explicit language-action alignment mechanism. Our approach leverages contrastive learning to evaluate language-action consistency and employs offline preference learning to optimize multimodal joint grounding. Evaluated on the LanguageTable benchmark, the method establishes a strong baseline that matches the performance of fully supervised fine-tuning while substantially reducing reliance on finely annotated data, further revealing key characteristics of effective multimodal grounded representations.
📝 Abstract
Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
language-action alignment
robot transparency
multimodal grounding
hierarchical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

explicit language-action alignment
hierarchical Vision-Language-Action models
contrastive grounding
offline preference learning
multimodal representation
🔎 Similar Papers
No similar papers found.