AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) systems, which often fail to maintain object tracking and precise localization under occlusion due to weak spatial awareness and insufficient temporal memory. To overcome these challenges, we introduce a visual anchoring mechanism into the VLA framework for the first time, preserving initial scene context through anchor images. A lightweight spatial encoder is designed to jointly process both anchor and current frames, enhancing the model’s capacity to reason about geometric relationships and temporal context—without requiring additional sensing modalities such as depth maps or point clouds. Built upon the Qwen2.5-VL backbone with a diffusion-based action head, our approach achieves a 13.6% performance gain on the Simpler WidowX benchmark and demonstrates an average success rate of 80% on real-world robotic tasks.

Technology Category

Application Category

📝 Abstract
Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that jointly processes the anchor and current frames to expose geometric relationships within an episode. Built on a Qwen2.5-VL backbone with a diffusion-based action head, AnchorVLA4D requires no additional sensing modalities (e.g., depth or point clouds) and introduces negligible inference overhead. Combining anchoring with a frozen pretrained spatial encoder yields further gains, realizing a 13.6% improvement on the Simpler WidowX benchmark and confirming the approach on real-world tasks, where it achieved an average success rate of 80%.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
spatial perception
temporal memory
robotic manipulation
occlusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual anchors
spatial-temporal reasoning
Vision-Language-Action (VLA)
geometric relationships
robotic manipulation
🔎 Similar Papers
No similar papers found.
J
Juan Zhu
PrimeBot
Z
Zhanying Shao
School of Computer Science, Peking University
Xiaoqi Li
Xiaoqi Li
Peking University
RoboticsComputer Vision
E
Ethan Morgan
PrimeBot
J
Jiadong Xu
School of Computer Science, Peking University
Hongwei Fan
Hongwei Fan
Peking University
Robotics3D Vision
Hao Dong
Hao Dong
Tenured Associate Professor at Peking University
Embodied AIRobotics3D VisionRobot LearningReinforcement Learning