Mimic Intent, Not Just Trajectories

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing imitation learning methods exhibit limited generalization under environmental changes and skill transfer scenarios because they merely reproduce observed trajectories without capturing underlying behavioral intent. This work proposes an end-to-end framework that explicitly decouples intent from execution details through multi-scale frequency-domain tokenization, establishing a hierarchical intent–execution structure. The model enables progressive autoregressive generation from coarse-grained intent tokens to fine-grained actions. Notably, it is the first approach to support one-shot skill transfer by injecting only high-level intent tokens. Evaluated across multiple simulated manipulation benchmarks and real-world robotic platforms, the method achieves state-of-the-art success rates while significantly improving robustness to perturbations, inference efficiency, and transfer performance.

Technology Category

Application Category

📝 Abstract

While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: \textit{``Mimic Intent, Not just Trajectories''(MINT)}. We achieve this via \textit{multi-scale frequency-space tokenization}, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract \textit{Intent token} that facilitates planning and transfer, and multi-scale \textit{Execution tokens} that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through \textit{next-scale autoregression}, performing progressive \textit{intent-to-execution reasoning}, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables \textit{one-shot transfer} of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.

Problem

Research questions and friction points this paper is trying to address.

imitation learning

skill transfer

environmental adaptation

behavior intent

trajectory mimicry

Innovation

Methods, ideas, or system contributions that make the work stand out.

intent disentanglement

multi-scale tokenization

frequency-space representation