🤖 AI Summary
Existing automatic image captioning evaluation metrics suffer from poor interpretability, inconsistent interpretation criteria, and insufficient empirical validation. To address these limitations, we propose EXPERT—the first reference-free, structured-explanation-based evaluation framework for interpretable caption assessment. EXPERT jointly models scoring and multi-dimensional explanation generation along three core dimensions: fluency, relevance, and descriptiveness. Methodologically, we introduce the first large-scale, human-annotated dataset with structured explanations and design a two-stage supervised learning paradigm that jointly optimizes score prediction and explanation generation. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance in both correlation with human judgments and explanation fidelity. Human evaluation further confirms that EXPERT’s explanations are significantly more informative and faithful than those produced by existing metrics. By unifying quantitative scoring with transparent, dimension-specific reasoning, EXPERT establishes a new paradigm for trustworthy and interpretable vision-language evaluation.
📝 Abstract
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.