OminiControl2: Efficient Conditioning for Diffusion Transformers

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from inefficient conditional processing and poor scalability to long-text inputs in multi-condition image generation. Method: We propose a dynamic conditional token compression and strided conditional feature reuse mechanism, integrating semantic-aware token pruning, cache-based feature reuse, and lightweight adapter design—preserving parameter efficiency and multimodal compatibility while drastically reducing computational overhead. Contribution/Results: Our approach reduces conditional processing FLOPs by over 90% and accelerates multi-condition generation by 5.9×, without compromising image quality or fine-grained controllability. It enables high-fidelity, multimodal-conditioned synthesis with real-time feasibility, establishing a new paradigm for efficient DiT-based multimodal conditional modeling.

Technology Category

Application Category

📝 Abstract
Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$ imes$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.
Problem

Research questions and friction points this paper is trying to address.

Efficient fine-grained control of text-to-image diffusion transformers.
Reducing computational inefficiency with long conditional inputs.
Achieving practical multi-modal control for high-quality image synthesis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compression strategy for conditional inputs
Conditional feature reuse across denoising steps
Reduced computational costs by over 90%
🔎 Similar Papers
No similar papers found.