🤖 AI Summary
Existing audiovisual quality assessment datasets are limited in scale, lack diversity in content and quality degradation types, and provide only holistic scores, thereby hindering research on multimodal perception mechanisms. To address these limitations, this work proposes a crowdsourced subjective evaluation framework that transcends traditional laboratory constraints, integrating a systematic data sampling strategy with a multidimensional annotation scheme. This approach yields YT-NTU-AVQ, the largest and most diverse audiovisual quality assessment dataset to date, comprising 1,620 user-generated videos spanning a broad spectrum of semantic scenarios and quality levels. The dataset and associated platform code have been publicly released, significantly advancing the study and development of multimodal perceptual modeling.
📝 Abstract
Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ