ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limited 3D spatial understanding in current vision–language–action (VLA) models and the gradient interference caused by conventional multi-layer alignment strategies, which hinder effective exploitation of depth information. To overcome these challenges, the authors propose a residual-guided multi-layer alignment framework that establishes layer-invariant mappings between VLA models and 3D vision foundation models via a shared projector. The approach incorporates residual flow alignment, Matryoshka sparse activation, and a training-free layer selection strategy to substantially reduce computational overhead and mitigate gradient conflicts. Evaluated on the LIBERO benchmark, the method achieves 98.5% of the state-of-the-art success rate using only 4% of the computational budget, and demonstrates strong generalization across LIBERO-Plus, RoboTwin, and diverse VLA architectures.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, na\"ive multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

3D spatial understanding

representation alignment

multi-layer alignment

gradient interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-layer alignment

residual-oriented

shared projector