Mimic Intent, Not Just Trajectories

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing imitation learning methods exhibit limited generalization under environmental changes and skill transfer scenarios because they merely reproduce observed trajectories without capturing underlying behavioral intent. This work proposes an end-to-end framework that explicitly decouples intent from execution details through multi-scale frequency-domain tokenization, establishing a hierarchical intent–execution structure. The model enables progressive autoregressive generation from coarse-grained intent tokens to fine-grained actions. Notably, it is the first approach to support one-shot skill transfer by injecting only high-level intent tokens. Evaluated across multiple simulated manipulation benchmarks and real-world robotic platforms, the method achieves state-of-the-art success rates while significantly improving robustness to perturbations, inference efficiency, and transfer performance.

Technology Category

Application Category

📝 Abstract
While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories''(MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.
Problem

Research questions and friction points this paper is trying to address.

imitation learning
skill transfer
environmental adaptation
behavior intent
trajectory mimicry
Innovation

Methods, ideas, or system contributions that make the work stand out.

intent disentanglement
multi-scale tokenization
frequency-space representation
one-shot skill transfer
autoregressive action generation
🔎 Similar Papers
No similar papers found.
R
Renming Huang
Shanghai Jiao Tong University, Shanghai Innovation Institute
C
Chendong Zeng
Shanghai Jiao Tong University, Shanghai Innovation Institute
Wenjing Tang
Wenjing Tang
Shanghai JIao Tong University
Robotics
J
Jingtian Cai
Shanghai Jiao Tong University, Shanghai Innovation Institute
C
Cewu Lu
Shanghai Jiao Tong University, Shanghai Innovation Institute
P
Panpan Cai
Shanghai Jiao Tong University, Shanghai Innovation Institute