Imit Diff: Semantics Guided Diffusion Transformer with Dual Resolution Fusion for Imitation Learning

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the weak skill generalization of vision-based motion imitation learning in complex scenes due to visual interference, this paper proposes a semantic-guided dual-resolution diffusion Transformer framework. Methodologically, it introduces (1) the first explicit mapping mechanism from semantic instructions to pixel-level visual grounding; (2) a dual-resolution encoder coupled with a consistency-driven diffusion Transformer, jointly optimizing robustness, real-time inference, and motion smoothness; and (3) integration of vision-language foundation models, multi-scale visual enhancement, and cross-resolution feature consistency modeling. Evaluated on multiple real-world robotic manipulation and navigation tasks, the framework achieves significant improvements over state-of-the-art methods—particularly under strong visual occlusion and zero-shot category generalization settings—demonstrating superior generalization, adaptability, and deployment feasibility.

Technology Category

Application Category

📝 Abstract

Visuomotor imitation learning enables embodied agents to effectively acquire manipulation skills from video demonstrations and robot proprioception. However, as scene complexity and visual distractions increase, existing methods that perform well in simple scenes tend to degrade in performance. To address this challenge, we introduce Imit Diff, a semanstic guided diffusion transformer with dual resolution fusion for imitation learning. Our approach leverages prior knowledge from vision language foundation models to translate high-level semantic instruction into pixel-level visual localization. This information is explicitly integrated into a multi-scale visual enhancement framework, constructed with a dual resolution encoder. Additionally, we introduce an implementation of Consistency Policy within the diffusion transformer architecture to improve both real-time performance and motion smoothness in embodied agent control.We evaluate Imit Diff on several challenging real-world tasks. Due to its task-oriented visual localization and fine-grained scene perception, it significantly outperforms state-of-the-art methods, especially in complex scenes with visual distractions, including zero-shot experiments focused on visual distraction and category generalization. The code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improves imitation learning in complex scenes

Utilizes dual resolution fusion for visual enhancement

Enhances real-time performance and motion smoothness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantics guided diffusion transformer

Dual resolution fusion technique

Consistency Policy implementation

🔎 Similar Papers

No similar papers found.