VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing vision-language-action models suffer from limited spatial perception due to pretraining on 2D images without 3D geometric supervision. This work proposes VEGA, a framework that introduces a lightweight projector at the output layer of the visual encoder to explicitly align features with a DINOv2-FiT3D model fine-tuned under multi-view consistent 3D Gaussian Splatting supervision—achieving explicit spatial alignment prior to language-semantic entanglement for the first time. By decoupling spatial reasoning from semantic interference, VEGA enhances geometric interpretability and generalization while incurring no additional inference cost. Experimental results demonstrate that VEGA outperforms current implicit alignment approaches across both simulated and real-world manipulation tasks, establishing a new state-of-the-art performance.

📝 Abstract

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language-action models

3D geometric supervision

spatial grounding

visual encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial grounding

vision-language-action models

3D-aware alignment