LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based image caption evaluation metrics suffer from generation bias and insufficient neutrality, while traditional LLM-free metrics exhibit unstable performance and struggle to unify reference-based and reference-free evaluation. This paper introduces Pearl: the first supervised, LLM-free image caption evaluation method supporting flexible reference configurations (i.e., both reference-based and reference-free settings). Its core innovation lies in decoupling semantic alignment from generation preference by jointly modeling image–caption and caption–caption similarity, integrating contrastive learning with multimodal embedding alignment, and trained on a large-scale dataset of 333K human-annotated samples. Pearl achieves state-of-the-art performance across five established benchmarks—including Composite and Flickr8K-Expert—outperforming all existing LLM-free methods under both reference-based and reference-free evaluation protocols.

Technology Category

Application Category

📝 Abstract
We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.
Problem

Research questions and friction points this paper is trying to address.

Develops an LLM-free metric for image caption evaluation
Addresses bias in LLM-based metrics favoring their own outputs
Enhances performance in both reference-based and reference-free settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes LLM-free supervised metric Pearl
Learns image-caption and caption-caption similarities
Uses human-annotated dataset with 333k judgments
🔎 Similar Papers
No similar papers found.
S
Shinnosuke Hirano
Keio University
Yuiga Wada
Yuiga Wada
Ph.D. Student, Keio University
Machine Learning
K
Kazuki Matsuda
Keio University
S
Seitaro Otsuki
Keio University
Komei Sugiura
Komei Sugiura
Professor, Keio University
Multimodal AIRobot LearningEmbodied AIMachine Learning