Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Current text-to-image diffusion models commonly suffer from foreground bias, resulting in degraded background quality and insufficient scene-wide coherence, which hinders controllable composition of objects and backgrounds. This work proposes a training-free sampling framework that dynamically guides spatial generation through a timestep-dependent soft gating mechanism. By integrating internal attention statistics with external semantic signals, the method performs multi-path latent trajectory pruning to explicitly model foreground-background interactions. For the first time, this approach significantly enhances spatial balance and semantic consistency without any additional training, consistently improving background coherence and object-background alignment across multiple diffusion backbones. A dedicated evaluation benchmark is also introduced to validate the method’s generalization capability.

Technology Category

Application Category

📝 Abstract

Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.

Problem

Research questions and friction points this paper is trying to address.

foreground bias

background coherence

object-background compositionality

text-to-image generation

scene coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free

Object-Background Composition

Dynamic Spatial Guidance