Does FLUX Already Know How to Perform Physically Plausible Image Composition?

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing image synthesis methods suffer from insufficient fidelity under complex lighting conditions (e.g., shadows, water reflections) and high-resolution inputs, and commonly rely on latent inversion or attention-based intervention—hindering full exploitation of pretrained diffusion models’ physical plausibility and resolution priors. To address this, we propose SHINE: a fine-tuning-free, inversion-free framework that enables plug-and-play physically plausible synthesis with text-to-image diffusion models (e.g., FLUX). SHINE comprises three core components: manifold-guided anchor loss, degradation-suppression guidance, and adaptive background fusion. It seamlessly integrates with adapters such as IP-Adapter for high-fidelity object insertion. Evaluated on our newly introduced benchmarks—ComplexCompo and DreamEditBench—SHINE achieves state-of-the-art performance across DINOv2, DreamSim, ImageReward, and human preference metrics, significantly improving realism and compositional consistency of synthesized images.

Technology Category

Application Category

📝 Abstract

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

Problem

Research questions and friction points this paper is trying to address.

Achieving physically plausible image composition with complex lighting effects

Overcoming limitations of existing models with high-resolution diverse inputs

Enabling faithful object insertion without compromising background integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold-steered anchor loss using pretrained customization adapters

Degradation-suppression guidance to eliminate low-quality outputs

Adaptive background blending to remove visible seams

🔎 Similar Papers

No similar papers found.