LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM-based image caption evaluation metrics suffer from generation bias and insufficient neutrality, while traditional LLM-free metrics exhibit unstable performance and struggle to unify reference-based and reference-free evaluation. This paper introduces Pearl: the first supervised, LLM-free image caption evaluation method supporting flexible reference configurations (i.e., both reference-based and reference-free settings). Its core innovation lies in decoupling semantic alignment from generation preference by jointly modeling image–caption and caption–caption similarity, integrating contrastive learning with multimodal embedding alignment, and trained on a large-scale dataset of 333K human-annotated samples. Pearl achieves state-of-the-art performance across five established benchmarks—including Composite and Flickr8K-Expert—outperforming all existing LLM-free methods under both reference-based and reference-free evaluation protocols.

Technology Category

Application Category

📝 Abstract

We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.

Problem

Research questions and friction points this paper is trying to address.

Develops an LLM-free metric for image caption evaluation

Addresses bias in LLM-based metrics favoring their own outputs

Enhances performance in both reference-based and reference-free settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes LLM-free supervised metric Pearl

Learns image-caption and caption-caption similarities

Uses human-annotated dataset with 333k judgments

🔎 Similar Papers

No similar papers found.

Authors to Follow