🤖 AI Summary
This study systematically evaluates bias in foundation multimodal language models across text, image, audio, and video tasks with respect to demographic groups and languages. By constructing a cross-modal benchmark and integrating fairness metrics with multidimensional subgroup analysis, the work reveals—for the first time—the distributional patterns of bias in multimodal settings. The findings indicate that image and video tasks generally exhibit strong performance with minimal bias, whereas audio tasks suffer from significant performance degradation and prediction collapse, particularly along dimensions of age, gender, and language. This research establishes a comprehensive framework and provides empirical evidence for assessing fairness in foundation multimodal models.
📝 Abstract
This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.