🤖 AI Summary
Diffusion models for text-to-image (T2I) generation frequently suffer from object fragmentation or incompleteness, undermining downstream utility. We identify RandomCrop—a widely adopted data augmentation during pretraining—as a primary cause of this issue. To address it, we propose a fine-tuning-free boundary activation penalty: during early denoising steps in Stable Diffusion, we suppress feature activations in the boundary regions of UNet’s intermediate feature maps, thereby encouraging globally coherent and structurally complete object generation. Our method operates solely via forward-inference-time feature modulation, incurring negligible computational overhead. Experiments demonstrate consistent and significant improvements in object completeness and overall image quality across multiple benchmarks. It generalizes effectively across diverse prompts and scene configurations, requiring no architectural modification or retraining. This work establishes a novel plug-and-play paradigm for integrity-aware T2I generation.
📝 Abstract
Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts undermine the model's performance in downstream applications. In this study, we conduct an in-depth analysis of the incompleteness issue and reveal that the primary factor behind incomplete object generation is the usage of RandomCrop during model training. This widely used data augmentation method, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.