🤖 AI Summary
To address the excessive parameter and computational overhead incurred by ControlNet-style paradigms in controllable text-to-image generation with DiT architectures, this paper proposes NanoControl—a lightweight control framework. Methodologically, NanoControl: (1) designs a LoRA-style control module atop the Flux backbone, eliminating redundant backbone duplication; (2) introduces KV-Context Augmentation to enable deep yet efficient injection of conditional features into the attention mechanism; and (3) synergistically integrates low-rank adaptation, KV-cache enhancement, and direct passthrough learning of original conditioning signals. Experiments demonstrate that NanoControl incurs only +0.024% additional parameters and +0.029% GFLOPs overhead, yet achieves state-of-the-art controllable generation performance across multiple benchmarks—significantly improving both inference efficiency and control fidelity.
📝 Abstract
Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024% increase in parameter count and a 0.029% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.