๐ค AI Summary
Current Vision-Language-Action (VLA) models process temporal visual inputs independently, implicitly assuming Markovian dynamics and neglecting historical contextโleading to suboptimal utilization of visual tokens in dynamic sequential decision-making. To address this, we reformulate VLA modeling from a Partially Observable Markov Decision Process (POMDP) perspective and propose an active visual attention mechanism: a recurrent network approximates the belief state, and visual token selection is dynamically modulated by this history-conditioned belief, enabling context-aware soft attention. Our method integrates a vision-language backbone with recursive state modeling. It achieves state-of-the-art performance on the LIBERO and CALVIN benchmarks and demonstrates strong sim-to-real transfer on a dual-arm robotic platform. The core contribution lies in embedding belief-driven temporal awareness into the VLA architecture, substantially enhancing long-horizon visual reasoning for embodied agents.
๐ Abstract
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.