MEF: A Systematic Evaluation Framework for Text-to-Image Models

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current T2I model evaluation methods suffer from poor scenario adaptability and low external validity; both ELO-based overall rankings and unidimensional MOS scores exhibit weak interpretability and dimensional imbalance. To address these issues, we propose a user-scenario-oriented structured evaluation framework: (1) we construct a comprehensive taxonomy covering both capability dimensions and realistic application scenarios; (2) we design Magic-Bench-377, a benchmark dataset grounded in authentic usage contexts; (3) we integrate ELO ranking with multidimensional MOS scoring and employ multivariate logistic regression to quantify the contribution weight of each dimension to user satisfaction. Our contributions include fine-grained leaderboards and capability profiles for mainstream T2I models, alongside an open-source evaluation framework and benchmark dataset—collectively enhancing interpretability, external validity, and practical guidance for model selection and development.

Technology Category

Application Category

📝 Abstract

Rapid advances in text-to-image (T2I) generation have raised higher requirements for evaluation methodologies. Existing benchmarks center on objective capabilities and dimensions, but lack an application-scenario perspective, limiting external validity. Moreover, current evaluations typically rely on either ELO for overall ranking or MOS for dimension-specific scoring, yet both methods have inherent shortcomings and limited interpretability. Therefore, we introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models. First, we propose a structured taxonomy encompassing user scenarios, elements, element compositions, and text expression forms to construct the Magic-Bench-377, which supports label-level assessment and ensures a balanced coverage of both user scenarios and capabilities. On this basis, we combine ELO and dimension-specific MOS to generate model rankings and fine-grained assessments respectively. This joint evaluation method further enables us to quantitatively analyze the contribution of each dimension to user satisfaction using multivariate logistic regression. By applying MEF to current T2I models, we obtain a leaderboard and key characteristics of the leading models. We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-image models lacks application-scenario perspective

Current evaluation methods have limited interpretability and inherent shortcomings

There is no systematic framework for fine-grained and user-centered assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured taxonomy for scenario-based benchmark construction

Combined ELO and MOS for multi-dimensional model assessment

Multivariate regression analyzing dimension contributions to satisfaction

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis