AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

200K/year
πŸ€– AI Summary
This work addresses a key limitation in existing vision-language-action (VLA) models, which generate actions in a unified space and are often dominated by large-scale motions, thereby obscuring subtle yet critical corrective signals essential for task success. To overcome this, the authors propose a hierarchical action modeling paradigm that decouples action generation into two stages: coarse trajectory anchor planning and residual refinement. The first stage establishes a skeletal motion structure, while the second enhances execution precision through geometrically and contact-aware local adjustments. Additionally, a decision-aware gripper refinement module is introduced to capture the gripper’s discrete nature and sensitivity to boundary conditions. The framework flexibly integrates either regression- or diffusion-based foundation models and supports multi-task VLA policy optimization. Experiments demonstrate substantial performance gains across LIBERO, CALVIN, and real-world robotic tasks, with simulation success rates improving by up to 7.8% and real-world success rates increasing by as much as 18%.

Technology Category

Application Category

πŸ“ Abstract
Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action models
trajectory planning
action refinement
precision manipulation
hierarchical action modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical action modeling
trajectory anchor
residual refinement
gripper refinement
vision-language-action
πŸ”Ž Similar Papers
No similar papers found.