🤖 AI Summary
The Arabic multimodal machine learning (MML) field lacks a systematic survey and structured taxonomy, leaving research gaps, critical bottlenecks, and future directions unclear. Method: We conduct a comprehensive literature review and multimodal (text/audio/visual) technical analysis to propose the first four-dimensional taxonomy for Arabic MML—covering datasets, application scenarios, modeling approaches, and core challenges. Contribution/Results: Our taxonomy reveals underexplored directions, including cross-modal alignment, low-resource robustness, and culturally adaptive modeling, while identifying key bottlenecks: data scarcity, inconsistent annotation practices, and absent standardized evaluation protocols. The framework delivers a structured knowledge graph and a reproducible research roadmap for Arabic MML, significantly enhancing the field’s conceptual clarity, methodological rigor, and scalability. This work establishes foundational infrastructure to accelerate principled, culturally grounded advances in Arabic multimodal AI.
📝 Abstract
Multimodal Machine Learning (MML) aims to integrate and analyze information from diverse modalities, such as text, audio, and visuals, enabling machines to address complex tasks like sentiment analysis, emotion recognition, and multimedia retrieval. Recently, Arabic MML has reached a certain level of maturity in its foundational development, making it time to conduct a comprehensive survey. This paper explores Arabic MML by categorizing efforts through a novel taxonomy and analyzing existing research. Our taxonomy organizes these efforts into four key topics: datasets, applications, approaches, and challenges. By providing a structured overview, this survey offers insights into the current state of Arabic MML, highlighting areas that have not been investigated and critical research gaps. Researchers will be empowered to build upon the identified opportunities and address challenges to advance the field.