AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current Vision-Language-Action (VLA) models process temporal visual inputs independently, implicitly assuming Markovian dynamics and neglecting historical context—leading to suboptimal utilization of visual tokens in dynamic sequential decision-making. To address this, we reformulate VLA modeling from a Partially Observable Markov Decision Process (POMDP) perspective and propose an active visual attention mechanism: a recurrent network approximates the belief state, and visual token selection is dynamically modulated by this history-conditioned belief, enabling context-aware soft attention. Our method integrates a vision-language backbone with recursive state modeling. It achieves state-of-the-art performance on the LIBERO and CALVIN benchmarks and demonstrates strong sim-to-real transfer on a dual-arm robotic platform. The core contribution lies in embedding belief-driven temporal awareness into the VLA architecture, substantially enhancing long-horizon visual reasoning for embodied agents.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.

Problem

Research questions and friction points this paper is trying to address.

Improving visual token processing in dynamic sequential decision-making

Addressing history-agnostic limitations in Vision-Language-Action models

Enhancing visual attention through belief state approximation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Visual Attention modulates visual processing dynamically

Recurrent state approximates belief state for context

Soft weights prioritize task-relevant visual tokens

🔎 Similar Papers

Mamba Fusion: Learning Actions Through Questioning