MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the challenge of reference-free image caption evaluation in capturing fine-grained semantic mismatches—such as hallucinations, missing attributes, or relational errors—by introducing a novel distributional scoring framework. For the first time, it models patch-level image and token-level text embeddings on the hypersphere as a multi-scale mixture of von Mises–Fisher distributions. The proposed method integrates bidirectional weighted KL divergence with global similarity to yield a comprehensive alignment score. It supports both single and multiple candidate captions and provides interpretable, decomposable diagnostic signals for local misalignments. Evaluated across multiple benchmarks, the approach achieves state-of-the-art correlation with human judgments while offering transparent, fine-grained error analysis capabilities.
📝 Abstract
Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.
Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation
image captioning
fine-grained mismatch
semantic discrepancy
distributional scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale distributional scoring
reference-free evaluation
von Mises-Fisher mixture
image-text grounding
KL divergence