🤖 AI Summary
To address unreliable control and slow inference in visual-motor imitation learning for complex non-Markovian tasks, this paper proposes the Dual-Paradigm Masked Generation and Refinement framework (MGP). MGP comprises two complementary variants: MGP-Short enables parallel token generation with score-guided iterative refinement of low-confidence tokens; MGP-Long supports single-shot global trajectory prediction coupled with observation-driven dynamic regeneration. We introduce three key innovations: (1) discrete action tokenization, (2) conditional masked Transformer modeling, and (3) observation-adaptive refinement—unifying global trajectory consistency with strong environmental adaptability for the first time. Evaluated on 150 Meta-World and LIBERO tasks, MGP achieves a +9% average success rate improvement and up to 35× faster inference. Under dynamic or occluded conditions, success rates improve by 60%. Moreover, MGP is the first method to systematically resolve two fundamental non-Markovian manipulation challenges.
📝 Abstract
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.