SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing metrics for long-image-caption evaluation suffer from critical limitations: n-gram–based methods ignore semantic meaning; reference-based similarity (RS) metrics exhibit low human correlation; and large language model (LLM)–based metrics incur prohibitive computational cost. To address these issues, this paper proposes a reference-free, efficient, and highly correlated evaluation metric. Our method enhances the CLIPScore objective function via fine-grained specificity modeling—explicitly rewarding correct visual details while penalizing hallucinated or erroneous ones—to strengthen semantic accuracy in image–text cross-modal alignment. Retaining CLIP’s lightweight architecture, our approach achieves human correlation on par with state-of-the-art open-source LLM benchmarks while reducing computational overhead by over an order of magnitude. Extensive experiments demonstrate its suitability for high-frequency iterative evaluation during generative model training, striking an optimal balance between efficiency and discriminative power.

Technology Category

Application Category

📝 Abstract

As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long image captions with unreliable standard metrics

Addressing low correlation of similarity metrics to human judgments

Reducing high computational costs of LLM-based evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specificity-enhanced CLIPScore for evaluation

Reference-free representational similarity metric

Rewards correct details and penalizes incorrect ones

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis