AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing image-text alignment benchmarks rely on rule-based perturbations or short captions, failing to assess fine-grained semantic alignment. Method: We introduce the first fine-grained evaluation benchmark built on synthetically generated image-text pairs from diverse generative models (both image-to-text and text-to-image), along with the first fine-grained alignment metric specifically designed for synthetic data. Contribution/Results: Through systematic human annotation of sentence-level correctness and comprehensive evaluation, we uncover pervasive deficiencies in mainstream models—including CLIP—such as fine-grained perceptual blind spots, spatial localization errors, and generator-induced self-biases. Our benchmark significantly improves discriminative power for evaluating local semantic alignment in vision-language models, establishing a new paradigm and reproducible standard for fine-grained alignment modeling and assessment.

Technology Category

Application Category

📝 Abstract

Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-grained alignment between images and detailed text descriptions

Overcoming limitations of existing benchmarks using rule-based perturbations

Assessing VLMs as alignment evaluators through annotated synthetic image-caption pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic image-caption pairs benchmark fine-grained alignment

Sentence-level annotations directly assess VLM evaluators

Reveals CLIP blindness and detector self-preference issues

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis