MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing text-to-image generation evaluation metrics struggle to distinguish between foreground subjects and background due to their global image processing, leading to inaccurate assessments of concept fidelity and prompt adherence. This work proposes MaSC, the first spatially decomposed evaluation paradigm that leverages external foreground masks to decouple assessment into subject-specific concept fidelity and background-related prompt following. Built upon a frozen SigLIP2 SO400M-NaFlex model, MaSC incorporates masked maximum cosine matching, background-pooled embeddings, and subject-removed prompt contrastive scoring. Experiments demonstrate that MaSC achieves a Krippendorff’s α of 0.471 for concept fidelity on DreamBench++ and an identity recognition AUC of 0.992 on ORIDa, significantly outperforming CLIP-T baselines and exhibiting stronger alignment with human perception.

📝 Abstract

Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

Problem

Research questions and friction points this paper is trying to address.

concept preservation

prompt following

masked similarity

text-to-image generation

evaluation metric

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked similarity

concept preservation

prompt following