AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses a key limitation in existing vision-language-action (VLA) models, which generate actions in a unified space and are often dominated by large-scale motions, thereby obscuring subtle yet critical corrective signals essential for task success. To overcome this, the authors propose a hierarchical action modeling paradigm that decouples action generation into two stages: coarse trajectory anchor planning and residual refinement. The first stage establishes a skeletal motion structure, while the second enhances execution precision through geometrically and contact-aware local adjustments. Additionally, a decision-aware gripper refinement module is introduced to capture the gripper’s discrete nature and sensitivity to boundary conditions. The framework flexibly integrates either regression- or diffusion-based foundation models and supports multi-task VLA policy optimization. Experiments demonstrate substantial performance gains across LIBERO, CALVIN, and real-world robotic tasks, with simulation success rates improving by up to 7.8% and real-world success rates increasing by as much as 18%.

Technology Category

Application Category

📝 Abstract

Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action models

trajectory planning

action refinement

precision manipulation

hierarchical action modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical action modeling

trajectory anchor

residual refinement

gripper refinement

vision-language-action

🔎 Similar Papers

No similar papers found.