Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

📅 2025-03-18
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Evaluating long, fine-grained image descriptions generated by multimodal large language models (MLLMs) remains challenging, as conventional metrics (e.g., BLEU, CIDEr) exhibit significant degradation in human correlation, ranking accuracy, and sensitivity to hallucinations. Method: We conduct the first comprehensive empirical assessment of mainstream metrics’ adaptability to MLLM-specific phenomena—output style drift and semantic hallucination—using a multidimensional framework: human evaluation benchmarks, statistical correlation analysis, adversarial perturbation testing, and cross-model output comparison. Contribution/Results: We identify a substantial drop in metric–human judgment correlation; propose a novel, robust three-dimensional evaluation standard—faithfulness, richness, and discriminability—to better capture MLLM output quality; and outline a principled evolution path for evaluation paradigms in the MLLM era. Our findings expose fundamental limitations of existing metrics and provide actionable guidance for developing more reliable, human-aligned assessment methodologies.

Technology Category

Application Category

📝 Abstract
The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
Problem

Research questions and friction points this paper is trying to address.

Evaluating machine-generated image captions with Multimodal LLMs.
Assessing robustness and reliability of current evaluation metrics.
Exploring adaptability of metrics to longer, detailed MLLM captions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes image captioning metrics evolution
Assesses metrics across multiple dimensions
Explores challenges with MLLM-generated captions