Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing controllable text-to-image diffusion models suffer from structural distortions and semantic inconsistencies in fine-grained control over pose and scene layout. To address this, we propose a training-free dual-recursive feedback mechanism that alternately optimizes appearance and generation latent variables during diffusion sampling, enabling cross-category disentanglement and fusion of structure and appearance—e.g., transferring human poses onto tiger-shaped bodies. Our method integrates free-form conditional injection with joint dual-feedback optimization, substantially improving spatial structural fidelity and semantic consistency. Experiments demonstrate that our approach outperforms mainstream controllable generation models—including ControlNet and T2I-Adapter—on complex pose transfer tasks. Generated images exhibit high visual quality, robust structural integrity, and coherent semantics, validating the effectiveness of latent-space structural-appearance decoupling without additional training.

Technology Category

Application Category

📝 Abstract
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.
Problem

Research questions and friction points this paper is trying to address.

Accurately preserving spatial structures in controllable text-to-image models
Capturing fine-grained conditions for object poses and scene layouts
Integrating structural and appearance attributes without training requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Recursive Feedback system for refinement
Training-free approach for controllable diffusion models
Recursive latent update for structure-appearance fusion
🔎 Similar Papers
No similar papers found.