🤖 AI Summary
Large multimodal models (LMMs) exhibit limited capability in open-domain, multi-image visual quality comparison and fine-grained reasoning. Method: We introduce Multi-Quality-Bench—the first hierarchical visual quality assessment benchmark tailored for LMMs—comprising single-image, two-alternative forced-choice (2AFC), and multiple-choice (MCQ) tasks, with thousands of progressively refined evaluation samples. Our approach employs a human-perception-aligned, interpretable evaluation framework integrating instruction-tuned LMMs with joint binary preference and MCQ evaluation paradigms. Contribution/Results: We launched an international challenge attracting nearly 100 participating teams; five top-performing models demonstrated the efficacy of instruction tuning for visual quality assessment. Multi-Quality-Bench establishes a standardized, reproducible foundation for rigorous, large-scale evaluation and advances systematic research in LMM-based visual quality understanding.
📝 Abstract
This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.