🤖 AI Summary
To address the challenges of artifact localization and interpretable repair in text-to-image diffusion models, this paper proposes a two-stage “diagnose-then-treat” optimization framework. In the first stage, a pixel-level artifact detector is constructed to enable fine-grained, localization-aware defect identification. In the second stage, the detection confidence map is integrated into the diffusion reverse process via gradient modulation and pixel-wise weighted loss to guide precise artifact correction. Our key contributions include: (i) the first introduction of localization-aware diagnostic modeling into diffusion optimization; (ii) construction of a million-scale defective image dataset with a human-in-the-loop annotation protocol. Experiments across multiple mainstream diffusion models show an average 42.7% reduction in artifact rate, a 3.2 improvement in FID, and an mAP@0.5 of 68.9—demonstrating both strong visual interpretability and restoration efficacy.
📝 Abstract
In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.