🤖 AI Summary
Existing image-text alignment benchmarks rely on rule-based perturbations or short captions, failing to assess fine-grained semantic alignment. Method: We introduce the first fine-grained evaluation benchmark built on synthetically generated image-text pairs from diverse generative models (both image-to-text and text-to-image), along with the first fine-grained alignment metric specifically designed for synthetic data. Contribution/Results: Through systematic human annotation of sentence-level correctness and comprehensive evaluation, we uncover pervasive deficiencies in mainstream models—including CLIP—such as fine-grained perceptual blind spots, spatial localization errors, and generator-induced self-biases. Our benchmark significantly improves discriminative power for evaluating local semantic alignment in vision-language models, establishing a new paradigm and reproducible standard for fine-grained alignment modeling and assessment.
📝 Abstract
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.