Underlying Semantic Diffusion for Effective and Efficient In-Context Learning

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address three key limitations of diffusion models in multi-task visual generation—weak semantic representation, poor in-context learning capability, and low inference efficiency—this paper proposes a bottom-layer semantic-enhanced diffusion framework. Methodologically, it introduces a *Separate&Gather Adapter* to decouple task-specific conditioning while sharing the backbone network; a *Feedback-Aided Learning* mechanism that dynamically refines the denoising process via semantic feedback; and an *Efficient Step Sampling (ESS)* strategy for accelerated noise scheduling. Integrated with multi-task joint training and in-context fine-tuning, the framework achieves substantial improvements: a 7.47 reduction in FID on Map2Image, a 0.026 decrease in RMSE on Image2Map, and a 9.45× speedup in inference latency. Moreover, it significantly enhances cross-domain generalization and adaptability to unseen tasks.

Technology Category

Application Category

📝 Abstract

Diffusion models has emerged as a powerful framework for tasks like image controllable generation and dense prediction. However, existing models often struggle to capture underlying semantics (e.g., edges, textures, shapes) and effectively utilize in-context learning, limiting their contextual understanding and image generation quality. Additionally, high computational costs and slow inference speeds hinder their real-time applicability. To address these challenges, we propose Underlying Semantic Diffusion (US-Diffusion), an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities on multi-task scenarios. We introduce Separate&Gather Adapter (SGA), which decouples input conditions for different tasks while sharing the architecture, enabling better in-context learning and generalization across diverse visual domains. We also present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details and dynamically adapting to task-specific contextual cues. Furthermore, we propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels, which aims at optimizing training and inference efficiency while maintaining strong in-context learning performance. Experimental results demonstrate that US-Diffusion outperforms the state-of-the-art method, achieving an average reduction of 7.47 in FID on Map2Image tasks and an average reduction of 0.026 in RMSE on Image2Map tasks, while achieving approximately 9.45 times faster inference speed. Our method also demonstrates superior training efficiency and in-context learning capabilities, excelling in new datasets and tasks, highlighting its robustness and adaptability across diverse visual domains.

Problem

Research questions and friction points this paper is trying to address.

Enhances underlying semantics learning in diffusion models.

Improves computational efficiency and inference speed.

Boosts in-context learning and generalization across tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced diffusion model for semantic learning

Separate & Gather Adapter for task decoupling

Efficient Sampling Strategy for dense sampling

🔎 Similar Papers

Revisiting In-context Learning Inference Circuit in Large Language Models