🤖 AI Summary
Text-to-image diffusion models often suffer from poor semantic alignment between prompts and generated images, with no automated mechanism for detecting such semantic failures. To address this, we propose the first intrinsically interpretable diffusion architecture, built upon B-cos neural units and a condition-driven feature disentanglement mechanism that enables precise, token-level attribution from prompt words to corresponding image pixel regions during denoising. By integrating B-cos computation with attention-aware feature decomposition, our method generates clear, verifiable semantic-spatial correspondence maps without post-hoc explanation techniques. While preserving generation fidelity, it enables fine-grained semantic editing and fully automatic semantic consistency diagnosis—capabilities previously unattainable. This advances model transparency, controllability, and trustworthiness, establishing a new paradigm for controllable generation and human-AI collaboration.
📝 Abstract
Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.