RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary text-to-image diffusion models suffer from structural misalignment, conditional leakage, and visual artifacts when integrating structural conditioning signals (e.g., depth or pose maps), especially under significant distributional discrepancies between conditioning maps and RGB images. To address this, we propose a training-free spatial control framework that decouples feature injection timing from the denoising process. Our method introduces three core components: a structure-aware injection module that selectively integrates structural cues at optimal timesteps; an appearance-enhancement prompting mechanism to preserve fine-grained visual fidelity; and a denoising restart optimization strategy to refine structural coherence iteratively. The approach achieves superior structural alignment without compromising appearance quality. Evaluated across diverse zero-shot conditional settings—including depth-, pose-, and edge-guided generation—it establishes new state-of-the-art performance while maintaining strong generalizability and plug-and-play compatibility with off-the-shelf diffusion models.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses structural misalignment in training-free T2I diffusion models
Mitigates condition leakage and visual artifacts in generated images
Enhances structure and appearance control without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible feature injection framework decouples timestep
Structure-rich module adapts to alignment trade-off
Appearance-rich prompting enhances visual quality
🔎 Similar Papers
No similar papers found.
L
Liheng Zhang
Peking University
L
Lexi Pang
Peking University
H
Hang Ye
Peking University
Xiaoxuan Ma
Xiaoxuan Ma
Peking University
Computer VisionDigital HumansAI for Science
Y
Yizhou Wang
Peking University