LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current image captioning evaluation suffers from three critical limitations: lack of standardization, insufficient attention to social bias, and neglect of user preferences. To address these issues, we introduce LOTUS—a novel, multidimensional automatic evaluation benchmark that jointly assesses caption quality, quantifies social bias, and aligns with user preferences. LOTUS leverages large vision-language models for fine-grained analysis, employs a scalable scoring mechanism, and incorporates preference-sensitive evaluation to uncover intrinsic trade-offs between descriptive detail and bias risk. Experimental results demonstrate that state-of-the-art captioning models exhibit significant performance imbalances across dimensions, with no single model dominating all aspects. Moreover, the optimal model varies substantially depending on user preference profiles, underscoring the necessity and practicality of personalized evaluation. As the first standardized benchmark integrating fairness, reliability, and personalization, LOTUS establishes a new foundation for equitable and user-aware image captioning assessment.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.

Problem

Research questions and friction points this paper is trying to address.

Standardized evaluation criteria for detailed image captions

Assessing societal biases and risks in caption generation

Aligning caption quality with diverse user preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized criteria for detailed caption evaluation

Bias-aware assessments in image captioning

User preference-oriented evaluation framework

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis