MEF: A Systematic Evaluation Framework for Text-to-Image Models

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current T2I model evaluation methods suffer from poor scenario adaptability and low external validity; both ELO-based overall rankings and unidimensional MOS scores exhibit weak interpretability and dimensional imbalance. To address these issues, we propose a user-scenario-oriented structured evaluation framework: (1) we construct a comprehensive taxonomy covering both capability dimensions and realistic application scenarios; (2) we design Magic-Bench-377, a benchmark dataset grounded in authentic usage contexts; (3) we integrate ELO ranking with multidimensional MOS scoring and employ multivariate logistic regression to quantify the contribution weight of each dimension to user satisfaction. Our contributions include fine-grained leaderboards and capability profiles for mainstream T2I models, alongside an open-source evaluation framework and benchmark dataset—collectively enhancing interpretability, external validity, and practical guidance for model selection and development.

Technology Category

Application Category

📝 Abstract
Rapid advances in text-to-image (T2I) generation have raised higher requirements for evaluation methodologies. Existing benchmarks center on objective capabilities and dimensions, but lack an application-scenario perspective, limiting external validity. Moreover, current evaluations typically rely on either ELO for overall ranking or MOS for dimension-specific scoring, yet both methods have inherent shortcomings and limited interpretability. Therefore, we introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models. First, we propose a structured taxonomy encompassing user scenarios, elements, element compositions, and text expression forms to construct the Magic-Bench-377, which supports label-level assessment and ensures a balanced coverage of both user scenarios and capabilities. On this basis, we combine ELO and dimension-specific MOS to generate model rankings and fine-grained assessments respectively. This joint evaluation method further enables us to quantitatively analyze the contribution of each dimension to user satisfaction using multivariate logistic regression. By applying MEF to current T2I models, we obtain a leaderboard and key characteristics of the leading models. We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating text-to-image models lacks application-scenario perspective
Current evaluation methods have limited interpretability and inherent shortcomings
There is no systematic framework for fine-grained and user-centered assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured taxonomy for scenario-based benchmark construction
Combined ELO and MOS for multi-dimensional model assessment
Multivariate regression analyzing dimension contributions to satisfaction
🔎 Similar Papers
No similar papers found.
Xiaojing Dong
Xiaojing Dong
1ByteDance Seed
Weilin Huang
Weilin Huang
Bytedance Seed
Computer VisionDeep Learning
L
Liang Li
1ByteDance Seed
Y
Yiying Li
1ByteDance Seed
S
Shu Liu
1ByteDance Seed
T
Tongtong Ou
1ByteDance Seed
S
Shuang Ouyang
1ByteDance Seed
Y
Yu Tian
1ByteDance Seed
F
Fengxuan Zhao
1ByteDance Seed