Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current text-to-visual generation evaluation faces two critical bottlenecks: (1) small-scale, low-quality human annotations, and (2) the absence of a unified benchmark and model that jointly assesses visual fidelity and text–visual alignment. To address these, we introduce Q-Eval-100K—the first million-scale, fine-grained human evaluation benchmark—comprising 100K image/video samples and 960K Mean Opinion Score (MOS) annotations, enabling the first dual-dimensional (quality + alignment) unified evaluation. Based on this, we propose Q-Eval-Score, a multimodal unified assessment model supporting long-text prompt modeling via context-enhanced prompting, multi-task joint learning, and cross-modal alignment optimization. Extensive experiments demonstrate that Q-Eval-Score achieves state-of-the-art performance on Q-Eval-100K and multiple external benchmarks, exhibiting strong generalization and significantly surpassing prior methods in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.

Problem

Research questions and friction points this paper is trying to address.

Evaluating visual quality and alignment in text-to-vision content.

Developing a large-scale human-annotated dataset for model evaluation.

Proposing a unified model to assess quality and alignment effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest human-labeled dataset for text-to-vision evaluation

Unified model for visual quality and alignment assessment

Special improvements for long-text prompt alignment handling

🔎 Similar Papers

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings