VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Existing vision-language-action models suffer from limited spatial perception due to pretraining on 2D images without 3D geometric supervision. This work proposes VEGA, a framework that introduces a lightweight projector at the output layer of the visual encoder to explicitly align features with a DINOv2-FiT3D model fine-tuned under multi-view consistent 3D Gaussian Splatting supervision—achieving explicit spatial alignment prior to language-semantic entanglement for the first time. By decoupling spatial reasoning from semantic interference, VEGA enhances geometric interpretability and generalization while incurring no additional inference cost. Experimental results demonstrate that VEGA outperforms current implicit alignment approaches across both simulated and real-world manipulation tasks, establishing a new state-of-the-art performance.
📝 Abstract
Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
vision-language-action models
3D geometric supervision
spatial grounding
visual encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial grounding
vision-language-action models
3D-aware alignment
visual encoder
DINOv2-FiT3D
Hao Wang
Hao Wang
Peking University
AI4ScienceEmbodied AImachine learning
Xiaobao Wei
Xiaobao Wei
Institute of Software, Chinese Academy of Sciences
3D Vision
J
Jingyang He
Peking University, Beijing, China; Beijing Innovation Center of Humanoid Robotics, Beijing, China
C
Chengyu Bai
Peking University, Beijing, China; Beijing Innovation Center of Humanoid Robotics, Beijing, China
Chun-Kai Fan
Chun-Kai Fan
Peking University
Jiajun Cao
Jiajun Cao
Ph.D. Student, Peking University
MLLMComputer Vision
J
Jintao Chen
Peking University, Beijing, China; Beijing Innovation Center of Humanoid Robotics, Beijing, China
Y
Ying Li
Peking University, Beijing, China
S
Shanyu Rong
Peking University, Beijing, China
M
Ming Lu
Peking University, Beijing, China
X
Xiaozhu Ju
Peking University, Beijing, China; Beijing Innovation Center of Humanoid Robotics, Beijing, China
J
Jian Tang
Peking University, Beijing, China; Beijing Innovation Center of Humanoid Robotics, Beijing, China
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models