🤖 AI Summary
Existing image caption evaluation metrics suffer either from high computational costs or limitations inherent in models like CLIP, such as the bag-of-words assumption, length constraints, and insufficient sensitivity to fine-grained discrepancies. To address these issues, this work proposes a lightweight cross-encoder evaluation metric initialized from a visual question answering model and trained with mixed supervision that incorporates fine-grained negative examples generated by an adversarial large language model. This approach significantly enhances the model’s ability to detect vision-language inconsistencies. The resulting metric achieves state-of-the-art performance across multiple benchmarks while maintaining low computational overhead, making it well-suited for large-scale evaluation, quality-aware decoding, and reward modeling in reinforcement learning scenarios.
📝 Abstract
Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.