🤖 AI Summary
This work addresses the frequent omission of target concepts in text-to-image generation by multimodal diffusion transformers, which often arises due to insufficient activation of relevant semantic representations. The study presents the first identification of a linearly detectable “omission signal” within text embeddings and introduces Omission Signal Intervention (OSI), a method that enhances this signal to effectively promote the generation of missing concepts—without requiring model fine-tuning. Built upon linear probing and textual embedding analysis, OSI is compatible with prevailing multimodal diffusion transformer architectures and demonstrates substantial mitigation of concept omission on both FLUX.1-Dev and SD3.5-Medium. Notably, the approach maintains strong robustness even under extremely complex prompts.
📝 Abstract
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.