ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

199K/year
πŸ€– AI Summary
This work addresses the tendency of existing vision-language-action models to overlook language instructions, rely on visual shortcuts, and exhibit insensitivity to linguistic variations or ambiguity. To mitigate these issues, the authors propose an entity-centric 3D graph (GSM) coupled with a slow symbolic planner that generates interpretable subgoals. They introduce a Grounding Alignment Contrastive (GAC) loss, leveraging entity-level InfoNCE constraints to align linguistic subgoals with grounded entities. Additionally, a target embedding verification bottleneck is designed to enhance mutual information between language and actions, while an attention entropy–driven selective prediction mechanism enables the model to recognize intrinsic ambiguity. Evaluated on LIBERO-Plus, the method improves robustness from 30.3% to 71.5%, reduces language neglect by 3–4Γ—, achieves a Recall@1 of 0.71 in entity retrieval, and attains an AUROC of 0.81 on an ambiguity benchmark, with clarification rates rising dramatically from 0.09 to 0.81.

Technology Category

Application Category

πŸ“ Abstract
Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding $g_t$, whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
language ignorance
instruction sensitivity
ambiguity awareness
grounded alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prospective Reasoning
Grounded Alignment
Vision-Language-Action Models
Contrastive Learning
Ambiguity Awareness
πŸ”Ž Similar Papers