AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

๐Ÿ“… 2025-11-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current Vision-Language-Action (VLA) models process temporal visual inputs independently, implicitly assuming Markovian dynamics and neglecting historical contextโ€”leading to suboptimal utilization of visual tokens in dynamic sequential decision-making. To address this, we reformulate VLA modeling from a Partially Observable Markov Decision Process (POMDP) perspective and propose an active visual attention mechanism: a recurrent network approximates the belief state, and visual token selection is dynamically modulated by this history-conditioned belief, enabling context-aware soft attention. Our method integrates a vision-language backbone with recursive state modeling. It achieves state-of-the-art performance on the LIBERO and CALVIN benchmarks and demonstrates strong sim-to-real transfer on a dual-arm robotic platform. The core contribution lies in embedding belief-driven temporal awareness into the VLA architecture, substantially enhancing long-horizon visual reasoning for embodied agents.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
Problem

Research questions and friction points this paper is trying to address.

Improving visual token processing in dynamic sequential decision-making
Addressing history-agnostic limitations in Vision-Language-Action models
Enhancing visual attention through belief state approximation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Visual Attention modulates visual processing dynamically
Recurrent state approximates belief state for context
Soft weights prioritize task-relevant visual tokens
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Lei Xiao
LiAuto Inc.
J
Jifeng Li
LiAuto Inc.
Juntao Gao
Juntao Gao
Nara Institute of Science and Technology
Stochastic Network OptimizationMachine LearningIntelligent Transportation Systems
Feiyang Ye
Feiyang Ye
University of Technology Sydney, Ph.D student
Multi-Task Learning
Y
Yan Jin
LiAuto Inc.
J
Jingjing Qian
School of Data Science, The Chinese University of Hong Kong, Shenzhen
J
Jing Zhang
School of Information Science and Technology, Beijing University of Technology
Y
Yong Wu
LiAuto Inc.
X
Xiaoyuan Yu
LiAuto Inc.