Does FLUX Already Know How to Perform Physically Plausible Image Composition?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image synthesis methods suffer from insufficient fidelity under complex lighting conditions (e.g., shadows, water reflections) and high-resolution inputs, and commonly rely on latent inversion or attention-based intervention—hindering full exploitation of pretrained diffusion models’ physical plausibility and resolution priors. To address this, we propose SHINE: a fine-tuning-free, inversion-free framework that enables plug-and-play physically plausible synthesis with text-to-image diffusion models (e.g., FLUX). SHINE comprises three core components: manifold-guided anchor loss, degradation-suppression guidance, and adaptive background fusion. It seamlessly integrates with adapters such as IP-Adapter for high-fidelity object insertion. Evaluated on our newly introduced benchmarks—ComplexCompo and DreamEditBench—SHINE achieves state-of-the-art performance across DINOv2, DreamSim, ImageReward, and human preference metrics, significantly improving realism and compositional consistency of synthesized images.

Technology Category

Application Category

📝 Abstract
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
Problem

Research questions and friction points this paper is trying to address.

Achieving physically plausible image composition with complex lighting effects
Overcoming limitations of existing models with high-resolution diverse inputs
Enabling faithful object insertion without compromising background integrity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold-steered anchor loss using pretrained customization adapters
Degradation-suppression guidance to eliminate low-quality outputs
Adaptive background blending to remove visible seams
🔎 Similar Papers
No similar papers found.
Shilin Lu
Shilin Lu
Nanyang Technological University
Generative Models
Z
Zhuming Lian
Nanyang Technological University
Z
Zihan Zhou
Nanyang Technological University
S
Shaocong Zhang
Nanyang Technological University
C
Chen Zhao
Nanjing University
A
Adams Wai-Kin Kong
Nanyang Technological University