DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness and weak human-alignment of automatic image caption evaluation for Large Vision-Language Models (LVLMs) under domain shift, this paper proposes DISCODE—a distribution-aware scoring decoder that requires no fine-tuning. Its core is a test-time adaptation (TTA) mechanism: it models the output score distribution via a Gaussian prior and dynamically calibrates scores during inference through analytical optimization and a novel TTA loss. Furthermore, we introduce MCEval—the first robustness benchmark covering six diverse domains. Experiments demonstrate that DISCODE achieves state-of-the-art performance in reference-free evaluation across MCEval and four mainstream benchmarks, significantly improving cross-domain consistency and correlation with human judgments, while enabling zero-shot generalization to unseen domains.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Robust image caption evaluation under domain-shift scenarios
Generating evaluation scores better aligned with human judgments
Improving robustness in evaluation score estimation with adaptive methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-aware score decoder for robust evaluation
Test-time adaptive evaluation with Gaussian prior distribution
Analytical solution for efficient loss minimization
N
Nakamasa Inoue
Institute of Science Tokyo
K
Kanoko Goto
Institute of Science Tokyo
Masanari Oi
Masanari Oi
Institute of Science Tokyo
Natural Language ProcessingSpeech Synthesis
M
Martyna Gruszka
Institute of Science Tokyo
M
Mahiro Ukai
Institute of Science Tokyo
T
Takumi Hirose
Institute of Science Tokyo
Yusuke Sekikawa
Yusuke Sekikawa
DENSO IT Laboratory
computer visionmachine learning