🤖 AI Summary
To address the poor robustness and weak human-alignment of automatic image caption evaluation for Large Vision-Language Models (LVLMs) under domain shift, this paper proposes DISCODE—a distribution-aware scoring decoder that requires no fine-tuning. Its core is a test-time adaptation (TTA) mechanism: it models the output score distribution via a Gaussian prior and dynamically calibrates scores during inference through analytical optimization and a novel TTA loss. Furthermore, we introduce MCEval—the first robustness benchmark covering six diverse domains. Experiments demonstrate that DISCODE achieves state-of-the-art performance in reference-free evaluation across MCEval and four mainstream benchmarks, significantly improving cross-domain consistency and correlation with human judgments, while enabling zero-shot generalization to unseen domains.
📝 Abstract
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.