GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

227K/year
πŸ€– AI Summary
Existing vision-language-action (VLA) models tend to overfit to spurious visual cues in the absence of explicit guidance, limiting their generalization capabilities. This work proposes GuidedVLA, a novel framework that explicitly decouples the action decoder into multiple plug-and-play, specialized attention heads and introduces handcrafted auxiliary supervision signals to guide each head toward task-critical factorsβ€”such as object localization, spatial geometry, and temporal skill logic. By promoting attention head specialization within an end-to-end robotic learning pipeline, GuidedVLA substantially outperforms strong baselines in both simulation and real-world environments, achieving higher success rates on both in-distribution and out-of-distribution tasks. Moreover, the quality of each specialized factor is positively correlated with overall task performance.
πŸ“ Abstract
Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
action decoding
spurious correlations
generalization
task-relevant factors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
action attention specialization
task-relevant guidance
modular decoder
auxiliary supervision
πŸ”Ž Similar Papers
Xiaosong Jia
Xiaosong Jia
Assistant Professor, Institute of Trustworthy Embodied AI (TEAI), Fudan University
Embodied AIAutonomous DrivingWorld ModelReinforcement Learning
B
Bowen Yang
Shanghai Jiao Tong University
Z
Zuhao Ge
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
X
Xian Nie
Shanghai Jiao Tong University
Y
Yuchen Zhou
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
C
Cunxin Fan
Shanghai Jiao Tong University
Yufeng Li
Yufeng Li
East China Normal University
Artificial Intelligence
Y
Yilin Chai
Shanghai Jiao Tong University
C
Chao Jing
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Z
Zijian Liang
Shanghai Jiao Tong University
Qingwen Bu
Qingwen Bu
HKU | OpenDriveLab
Robot LearningComputer VisionMachine Learning
H
Haidong Cao
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
C
Chao Wu
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Qifeng Li
Qifeng Li
University of Central Florida
Convex OptimizationNonlinear SystemsElectrical and Energy SystemsSmart grid technologies
Zhenjie Yang
Zhenjie Yang
Tsinghua University
Networking
C
Chenhe Zhang
Institute of Trustworthy Embodied AI (TEAI), Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Hongyang Li
Hongyang Li
Assistant Professor, University of Hong Kong
Computer VisionAutonomous DrivingRobotics
Zuxuan Wu
Zuxuan Wu
Fudan University
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI