Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address insufficient spatial control precision—particularly the lack of temporal consistency during intermediate denoising steps—in text-to-image diffusion models, this paper proposes InnerControl, an intermediate feature feedback mechanism operating throughout the entire diffusion process. Built upon the ControlNet architecture, InnerControl introduces lightweight convolutional probes that reconstruct control signals (e.g., edges, depth) in real time from latent states at each denoising step. A cycle-consistency loss is further incorporated to enforce cross-step spatial alignment. Unlike existing approaches relying solely on final-output supervision, InnerControl significantly improves control fidelity and generation quality. Extensive experiments demonstrate state-of-the-art performance across diverse conditional generation tasks, including edge- and depth-guided synthesis, while maintaining computational efficiency.

Technology Category

Application Category

📝 Abstract

Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).

Problem

Research questions and friction points this paper is trying to address.

Achieving precise spatial control in text-to-image diffusion models

Neglecting intermediate generation stages limits alignment effectiveness

Improving control fidelity and generation quality across diffusion steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enforces spatial consistency across all diffusion steps

Uses lightweight convolutional probes for signal reconstruction

Minimizes discrepancy throughout entire diffusion process

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?