🤖 AI Summary
Current T2I model evaluation methods suffer from poor scenario adaptability and low external validity; both ELO-based overall rankings and unidimensional MOS scores exhibit weak interpretability and dimensional imbalance. To address these issues, we propose a user-scenario-oriented structured evaluation framework: (1) we construct a comprehensive taxonomy covering both capability dimensions and realistic application scenarios; (2) we design Magic-Bench-377, a benchmark dataset grounded in authentic usage contexts; (3) we integrate ELO ranking with multidimensional MOS scoring and employ multivariate logistic regression to quantify the contribution weight of each dimension to user satisfaction. Our contributions include fine-grained leaderboards and capability profiles for mainstream T2I models, alongside an open-source evaluation framework and benchmark dataset—collectively enhancing interpretability, external validity, and practical guidance for model selection and development.
📝 Abstract
Rapid advances in text-to-image (T2I) generation have raised higher requirements for evaluation methodologies. Existing benchmarks center on objective capabilities and dimensions, but lack an application-scenario perspective, limiting external validity. Moreover, current evaluations typically rely on either ELO for overall ranking or MOS for dimension-specific scoring, yet both methods have inherent shortcomings and limited interpretability. Therefore, we introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models. First, we propose a structured taxonomy encompassing user scenarios, elements, element compositions, and text expression forms to construct the Magic-Bench-377, which supports label-level assessment and ensures a balanced coverage of both user scenarios and capabilities. On this basis, we combine ELO and dimension-specific MOS to generate model rankings and fine-grained assessments respectively. This joint evaluation method further enables us to quantitatively analyze the contribution of each dimension to user satisfaction using multivariate logistic regression. By applying MEF to current T2I models, we obtain a leaderboard and key characteristics of the leading models. We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.