π€ AI Summary
This work addresses a key limitation in existing vision-language-action (VLA) models, which generate actions in a unified space and are often dominated by large-scale motions, thereby obscuring subtle yet critical corrective signals essential for task success. To overcome this, the authors propose a hierarchical action modeling paradigm that decouples action generation into two stages: coarse trajectory anchor planning and residual refinement. The first stage establishes a skeletal motion structure, while the second enhances execution precision through geometrically and contact-aware local adjustments. Additionally, a decision-aware gripper refinement module is introduced to capture the gripperβs discrete nature and sensitivity to boundary conditions. The framework flexibly integrates either regression- or diffusion-based foundation models and supports multi-task VLA policy optimization. Experiments demonstrate substantial performance gains across LIBERO, CALVIN, and real-world robotic tasks, with simulation success rates improving by up to 7.8% and real-world success rates increasing by as much as 18%.
π Abstract
Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.