CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 10
Influential: 2
📄 PDF
🤖 AI Summary
In robot visuomotor policy learning, diffusion models achieve high accuracy but suffer from slow inference and inflexible constraint handling. To address this, we propose Coarse-to-Fine Autoregressive Policy (CARP), a two-stage action generation framework: first, a hierarchical representation is learned via an action autoencoder with multi-scale sequence modeling; second, a GPT-style Transformer progressively refines predictions in an autoregressive manner. CARP introduces the first “coarse-to-fine autoregressive” paradigm, preserving diffusion-level accuracy while significantly improving inference efficiency and task generalization. Experiments demonstrate that CARP achieves state-of-the-art performance on both simulated and real-robot tasks—improving success rates by up to 10% and accelerating inference by 10×—thereby effectively resolving the long-standing accuracy–speed–generalization trade-off.

Technology Category

Application Category

📝 Abstract
In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
Problem

Research questions and friction points this paper is trying to address.

Improves action trajectory generation accuracy in robotics
Overcomes inefficiency of diffusion-based models in policy learning
Enhances flexibility and speed in visuomotor action prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine autoregressive action generation
Multi-scale action sequence autoencoder
GPT-style transformer for refinement
🔎 Similar Papers
No similar papers found.
Zhefei Gong
Zhefei Gong
Unknown affiliation
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
Shangke Lyu
Shangke Lyu
Westlake University
Robot controlLearning controlHuman-robot Interaction
Siteng Huang
Siteng Huang
Alibaba DAMO Academy | ZJU | Westlake University
Vision-language ModelsGenerative ModelsEmbodied AI
M
Mingyang Sun
Westlake University, Zhejiang University
W
Wei Zhao
Westlake University
Z
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing
D
Donglin Wang
Westlake University