🤖 AI Summary
Existing controllable text-to-image diffusion models suffer from structural distortions and semantic inconsistencies in fine-grained control over pose and scene layout. To address this, we propose a training-free dual-recursive feedback mechanism that alternately optimizes appearance and generation latent variables during diffusion sampling, enabling cross-category disentanglement and fusion of structure and appearance—e.g., transferring human poses onto tiger-shaped bodies. Our method integrates free-form conditional injection with joint dual-feedback optimization, substantially improving spatial structural fidelity and semantic consistency. Experiments demonstrate that our approach outperforms mainstream controllable generation models—including ControlNet and T2I-Adapter—on complex pose transfer tasks. Generated images exhibit high visual quality, robust structural integrity, and coherent semantics, validating the effectiveness of latent-space structural-appearance decoupling without additional training.
📝 Abstract
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.