LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image captioning evaluation suffers from three critical limitations: lack of standardization, insufficient attention to social bias, and neglect of user preferences. To address these issues, we introduce LOTUS—a novel, multidimensional automatic evaluation benchmark that jointly assesses caption quality, quantifies social bias, and aligns with user preferences. LOTUS leverages large vision-language models for fine-grained analysis, employs a scalable scoring mechanism, and incorporates preference-sensitive evaluation to uncover intrinsic trade-offs between descriptive detail and bias risk. Experimental results demonstrate that state-of-the-art captioning models exhibit significant performance imbalances across dimensions, with no single model dominating all aspects. Moreover, the optimal model varies substantially depending on user preference profiles, underscoring the necessity and practicality of personalized evaluation. As the first standardized benchmark integrating fairness, reliability, and personalization, LOTUS establishes a new foundation for equitable and user-aware image captioning assessment.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.
Problem

Research questions and friction points this paper is trying to address.

Standardized evaluation criteria for detailed image captions
Assessing societal biases and risks in caption generation
Aligning caption quality with diverse user preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized criteria for detailed caption evaluation
Bias-aware assessments in image captioning
User preference-oriented evaluation framework
🔎 Similar Papers
No similar papers found.