EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing automatic image captioning evaluation metrics suffer from poor interpretability, inconsistent interpretation criteria, and insufficient empirical validation. To address these limitations, we propose EXPERT—the first reference-free, structured-explanation-based evaluation framework for interpretable caption assessment. EXPERT jointly models scoring and multi-dimensional explanation generation along three core dimensions: fluency, relevance, and descriptiveness. Methodologically, we introduce the first large-scale, human-annotated dataset with structured explanations and design a two-stage supervised learning paradigm that jointly optimizes score prediction and explanation generation. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance in both correlation with human judgments and explanation fidelity. Human evaluation further confirms that EXPERT’s explanations are significantly more informative and faithful than those produced by existing metrics. By unifying quantitative scoring with transparent, dimension-specific reasoning, EXPERT establishes a new paradigm for trustworthy and interpretable vision-language evaluation.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized criteria for explainable image captioning metrics

Unverified quality of generated explanations in current metrics

Need for structured explanations based on fluency, relevance, descriptiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free metric with structured explanations

Two-stage template supervises scoring and explanation

State-of-the-art results with high-quality explanations

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis