🤖 AI Summary
This work addresses the challenge of quantifying and localizing visual inconsistency in subject-driven generation with diffusion models. We propose a visual-semantic disentanglement framework that leverages contrastive learning to decouple visual and semantic features from a pre-trained diffusion backbone, enabling fine-grained visual–semantic correspondence. Based on this, we introduce the Visual-Semantic Matching (VSM) metric—the first to enable spatial localization and quantitative assessment of visual inconsistencies in generated images. Unlike existing methods relying on global features from CLIP or DINO, VSM is trained in a supervised manner on a custom-annotated image-pair dataset, eliminating dependence on external vision-language models. Experiments demonstrate that VSM significantly outperforms state-of-the-art multimodal and self-supervised metrics in both detection accuracy and anomaly localization fidelity. Our approach establishes a novel, interpretable, and spatially grounded paradigm for evaluating generative model output quality.
📝 Abstract
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/