BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing image caption evaluation metrics suffer either from high computational costs or limitations inherent in models like CLIP, such as the bag-of-words assumption, length constraints, and insufficient sensitivity to fine-grained discrepancies. To address these issues, this work proposes a lightweight cross-encoder evaluation metric initialized from a visual question answering model and trained with mixed supervision that incorporates fine-grained negative examples generated by an adversarial large language model. This approach significantly enhances the model’s ability to detect vision-language inconsistencies. The resulting metric achieves state-of-the-art performance across multiple benchmarks while maintaining low computational overhead, making it well-suited for large-scale evaluation, quality-aware decoding, and reward modeling in reinforcement learning scenarios.

📝 Abstract

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

Problem

Research questions and friction points this paper is trying to address.

image captioning evaluation

reference-free evaluation

vision-language models

evaluation metrics

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-encoder

reference-free evaluation

vision-language alignment