🤖 AI Summary
Existing image synthesis methods suffer from insufficient fidelity under complex lighting conditions (e.g., shadows, water reflections) and high-resolution inputs, and commonly rely on latent inversion or attention-based intervention—hindering full exploitation of pretrained diffusion models’ physical plausibility and resolution priors. To address this, we propose SHINE: a fine-tuning-free, inversion-free framework that enables plug-and-play physically plausible synthesis with text-to-image diffusion models (e.g., FLUX). SHINE comprises three core components: manifold-guided anchor loss, degradation-suppression guidance, and adaptive background fusion. It seamlessly integrates with adapters such as IP-Adapter for high-fidelity object insertion. Evaluated on our newly introduced benchmarks—ComplexCompo and DreamEditBench—SHINE achieves state-of-the-art performance across DINOv2, DreamSim, ImageReward, and human preference metrics, significantly improving realism and compositional consistency of synthesized images.
📝 Abstract
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.