AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing action diversity, environmental reactivity, and execution stability in mobile manipulation. The authors propose AnchorVLA, a vision-language-action policy based on anchor-point diffusion that efficiently generates multimodal actions in closed-loop control. By performing localized denoising near plausible action manifolds and integrating a lightweight VLA backbone, an anchored diffusion action head, a truncated scheduling strategy, and a high-frequency residual self-correction module, AnchorVLA significantly improves task success rates and robustness under perturbations and distribution shifts. The method maintains low-latency inference and effectively mitigates error accumulation during chunked execution, achieving stable and responsive manipulation without sacrificing computational efficiency.
📝 Abstract
A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action distributions rather than collapsing to one solution. But in practice, full iterative denoising is costly at control time. Action chunking helps amortize inference, yet it also creates partially open-loop behavior, allowing small mismatches to accumulate into drift. We present AnchorVLA, a diffusion-based VLA policy for mobile manipulation built on the core insight that when sampling begins near a plausible solution manifold, extensive denoising is unnecessary to recover multimodal, valid actions. AnchorVLA combines a lightweight VLA adaptation backbone with an anchored diffusion action head, which denoises locally around anchor trajectories using a truncated diffusion schedule. This retains multimodal action generation while reducing inference cost for closed-loop control. Crucially, to mitigate chunking-induced drift, we introduce a test-time self-correction mechanism via a lightweight residual correction module that makes high-frequency, per-step adjustments during rollout. Across diverse mobile manipulation tasks, AnchorVLA improves success and stability under disturbances and distribution shifts while maintaining low-latency inference. The source code is made available at https://github.com/jason-lim26/AnchorVLA.
Problem

Research questions and friction points this paper is trying to address.

mobile manipulation
multimodal action
diffusion policy
real-time reactivity
action drift
Innovation

Methods, ideas, or system contributions that make the work stand out.

anchored diffusion
multimodal action generation
closed-loop control
residual correction
visual-language-action policy
J
Jia Syuen Lim
UQMM Lab, The University of Queensland, Brisbane, Australia
Z
Zhizhen Zhang
UQMM Lab, The University of Queensland, Brisbane, Australia
Peter Bohm
Peter Bohm
Postdoctoral Research Fellow at the University of Queensland
reinforcement learningdeep learning for robot control
B
Brendan Tidd
Robotics and Autonomous Systems Group, CSIRO, Brisbane, Australia
Zi Huang
Zi Huang
PhD Candidate
Deep Learning
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving