🤖 AI Summary
Existing image caption evaluation metrics struggle to simultaneously capture visual-semantic alignment and linguistic pragmatic plausibility. To address this, we propose a training-free, holistic evaluation framework that— for the first time—integrates three complementary signals: (1) global image-text distribution alignment via Mutual Information Divergence (MID); (2) cycle-generated image perceptual similarity guided by DINOv2; and (3) contextual text similarity measured by BERTScore. Our framework jointly models image semantic fidelity and linguistic interpretability. On Flickr8k, it achieves a Kendall-τ of 56.43, significantly outperforming 12 state-of-the-art baselines. Moreover, it demonstrates strong cross-domain generalization on Conceptual Captions and MS COCO, with evaluation scores exhibiting higher correlation with human judgments.
📝 Abstract
Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score, a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) BERTScore for contextual text similarity against human references. A calibrated fusion of these signals allows Redemption Score to offer a more holistic assessment. On the Flickr8k benchmark, Redemption Score achieves a Kendall-$ au$ of 56.43, outperforming twelve prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by effectively redeeming image semantics and linguistic interpretability indicated by strong transfer of knowledge in the Conceptual Captions and MS COCO datasets.