🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) systems, which often fail to maintain object tracking and precise localization under occlusion due to weak spatial awareness and insufficient temporal memory. To overcome these challenges, we introduce a visual anchoring mechanism into the VLA framework for the first time, preserving initial scene context through anchor images. A lightweight spatial encoder is designed to jointly process both anchor and current frames, enhancing the model’s capacity to reason about geometric relationships and temporal context—without requiring additional sensing modalities such as depth maps or point clouds. Built upon the Qwen2.5-VL backbone with a diffusion-based action head, our approach achieves a 13.6% performance gain on the Simpler WidowX benchmark and demonstrates an average success rate of 80% on real-world robotic tasks.
📝 Abstract
Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that jointly processes the anchor and current frames to expose geometric relationships within an episode. Built on a Qwen2.5-VL backbone with a diffusion-based action head, AnchorVLA4D requires no additional sensing modalities (e.g., depth or point clouds) and introduces negligible inference overhead. Combining anchoring with a frozen pretrained spatial encoder yields further gains, realizing a 13.6% improvement on the Simpler WidowX benchmark and confirming the approach on real-world tasks, where it achieved an average success rate of 80%.