🤖 AI Summary
Current multimedia quality assessment models rely solely on Mean Opinion Scores (MOS) as supervision, neglecting semantic defects, user intent, and judgment rationale—resulting in poor interpretability and contextual adaptability. To address this, we propose a paradigm shift beyond scalar supervision, introducing the first systematic framework integrating context-aware modeling, evidence-driven reasoning, and cross-modal semantic alignment. Methodologically, we leverage Vision-Language Models (VLMs) to construct multimodal joint representations, incorporate contextual metadata modeling, and design an expert-level rationale generation mechanism guided by structured prompting and differentiable reasoning modules. Our evaluation framework emphasizes semantic alignment fidelity, reasoning faithfulness, and situational sensitivity. Experiments demonstrate substantial improvements in interpretability, task adaptability, and human alignment—advancing quality assessment from opaque prediction toward a trustworthy, robust, and human-AI collaborative paradigm.
📝 Abstract
This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.