🤖 AI Summary
Current image captioning evaluation suffers from three critical limitations: lack of standardization, insufficient attention to social bias, and neglect of user preferences. To address these issues, we introduce LOTUS—a novel, multidimensional automatic evaluation benchmark that jointly assesses caption quality, quantifies social bias, and aligns with user preferences. LOTUS leverages large vision-language models for fine-grained analysis, employs a scalable scoring mechanism, and incorporates preference-sensitive evaluation to uncover intrinsic trade-offs between descriptive detail and bias risk. Experimental results demonstrate that state-of-the-art captioning models exhibit significant performance imbalances across dimensions, with no single model dominating all aspects. Moreover, the optimal model varies substantially depending on user preference profiles, underscoring the necessity and practicality of personalized evaluation. As the first standardized benchmark integrating fairness, reliability, and personalization, LOTUS establishes a new foundation for equitable and user-aware image captioning assessment.
📝 Abstract
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.