OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation

๐Ÿ“… 2025-12-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of benchmark datasets and robust models for Indonesian multimodal emotion recognition, this paper introduces IndoMERโ€”the first Indonesian video-based emotion benchmark (1,944 videos, 7 emotion classes)โ€”and proposes OmniMER, a framework built upon Qwen2.5-Omni that jointly models three auxiliary tasks: text keyword extraction, facial expression decoding, and prosodic analysis, enabling temporally aligned cross-modal fusion. This design mitigates cross-modal inconsistency and culture-driven long-tail distribution challenges. On IndoMER, OmniMER achieves Macro-F1 scores of 0.582 (coarse-grained) and 0.454 (fine-grained), outperforming baselines by 7.6 and 22.1 percentage points, respectively; it also demonstrates strong cross-lingual transfer performance on CH-SIMS. Key contributions include: (1) the first multilingual, low-resource-oriented Indonesian multimodal emotion benchmark; (2) an auxiliary-task-driven paradigm for adapting large language models to multimodal emotion recognition; and (3) a culturally adaptive, robust cross-modal fusion mechanism.

Technology Category

Application Category

๐Ÿ“ Abstract
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER
Problem

Research questions and friction points this paper is trying to address.

Develops a multimodal emotion recognition benchmark for Indonesian language
Addresses cross-modal inconsistency and long-tailed class distribution challenges
Proposes an auxiliary-enhanced LLM adaptation framework for emotion recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal adaptation framework using Qwen2.5-Omni
Auxiliary modality-specific perception tasks for emotion cues
Enhanced emotion recognition via cross-modal fusion
X
Xueming Yan
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006
Boyan Xu
Boyan Xu
Guangdong University of Technology
Text-to-SQLLarge Language ModelSentiment AnalysisMachine Learning
Y
Yaochu Jin
School of Engineering, Westlake University, Hangzhou 310030, China
L
Lixian Xiao
Faculty of Asian Languages and Cultures, Guangdong University of Foreign Studies, Guangzhou, China
W
Wenlong Ye
School of Computer Science, Guangdong University of Technology, Guangzhou, 510006
R
Runyang Cai
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006
Z
Zeqi Zheng
School of Engineering, Westlake University, Hangzhou 310030, China
J
Jingfa Liu
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006
A
Aimin Yang
School of Computer Science and Intelligence Education, Lingnan Normal University, Zhanjiang, 524000, China