From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the inefficiency in representation and weak conditional alignment between high-level semantics and low-level control in embodied intelligence, stemming from mismatches in spatiotemporal scales. To this end, the authors propose ResVLA, a novel architecture that shifts generative vision-language-action (VLA) policies from a “generation-from-noise” paradigm to an “intention-refinement” framework. By leveraging spectral analysis, ResVLA explicitly decouples motion into deterministic low-frequency intention anchors and stochastic high-frequency residuals, and introduces a residual diffusion bridge dedicated to refining local dynamics. This approach establishes, for the first time, an intention anchoring mechanism that significantly enhances conditional alignment during generation. Experiments demonstrate that ResVLA achieves faster convergence in simulation, exhibits greater robustness to linguistic and proprioceptive perturbations, and attains superior performance on real-world robotic tasks.

Technology Category

Application Category

📝 Abstract
Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.
Problem

Research questions and friction points this paper is trying to address.

embodied intelligence
semantic understanding
physical control
spatiotemporal scale mismatch
generative VLA policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual Diffusion Bridge
Intent Anchoring
Spectral Decoupling
Generative VLA
Refinement-from-Intent
🔎 Similar Papers