Imit Diff: Semantics Guided Diffusion Transformer with Dual Resolution Fusion for Imitation Learning

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak skill generalization of vision-based motion imitation learning in complex scenes due to visual interference, this paper proposes a semantic-guided dual-resolution diffusion Transformer framework. Methodologically, it introduces (1) the first explicit mapping mechanism from semantic instructions to pixel-level visual grounding; (2) a dual-resolution encoder coupled with a consistency-driven diffusion Transformer, jointly optimizing robustness, real-time inference, and motion smoothness; and (3) integration of vision-language foundation models, multi-scale visual enhancement, and cross-resolution feature consistency modeling. Evaluated on multiple real-world robotic manipulation and navigation tasks, the framework achieves significant improvements over state-of-the-art methods—particularly under strong visual occlusion and zero-shot category generalization settings—demonstrating superior generalization, adaptability, and deployment feasibility.

Technology Category

Application Category

📝 Abstract
Visuomotor imitation learning enables embodied agents to effectively acquire manipulation skills from video demonstrations and robot proprioception. However, as scene complexity and visual distractions increase, existing methods that perform well in simple scenes tend to degrade in performance. To address this challenge, we introduce Imit Diff, a semanstic guided diffusion transformer with dual resolution fusion for imitation learning. Our approach leverages prior knowledge from vision language foundation models to translate high-level semantic instruction into pixel-level visual localization. This information is explicitly integrated into a multi-scale visual enhancement framework, constructed with a dual resolution encoder. Additionally, we introduce an implementation of Consistency Policy within the diffusion transformer architecture to improve both real-time performance and motion smoothness in embodied agent control.We evaluate Imit Diff on several challenging real-world tasks. Due to its task-oriented visual localization and fine-grained scene perception, it significantly outperforms state-of-the-art methods, especially in complex scenes with visual distractions, including zero-shot experiments focused on visual distraction and category generalization. The code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Improves imitation learning in complex scenes
Utilizes dual resolution fusion for visual enhancement
Enhances real-time performance and motion smoothness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantics guided diffusion transformer
Dual resolution fusion technique
Consistency Policy implementation
🔎 Similar Papers
No similar papers found.
Y
Yuhang Dong
Zhejiang University
H
Haizhou Ge
Tsinghua University
Y
Yupei Zeng
Zhejiang University
J
Jiangning Zhang
Youtu Lab, Tecent
Beiwen Tian
Beiwen Tian
Tsinghua University
3D Computer Vision
Guanzhong Tian
Guanzhong Tian
Ningbo Research Institute, Zhejiang University
Computer VisionModel CompressionPattern Recognition
H
Hongrui Zhu
Zhejiang University
Y
Yufei Jia
Tsinghua University
R
Ruixiang Wang
Harbin Institute of Technology, Weihai
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics
G
Guyue Zhou
Tsinghua University
L
Longhua Ma
Zhejiang University