🤖 AI Summary
This work addresses the challenge of unified modeling for discrete emotion recognition and continuous affective analysis in multimodal sentiment understanding by proposing an Expert-Guided Multimodal Fusion (EGMF) framework. EGMF employs three specialized expert networks to capture local details, cross-modal semantic associations, and global contextual information, respectively. A hierarchical dynamic gating mechanism enables adaptive feature fusion, while pseudo-token injection and prompt conditioning facilitate the integration of enhanced representations into a large language model (LLM). This allows both classification and regression tasks to be handled in a generative manner within a unified architecture. As the first approach to combine dynamic multi-expert fusion with LLMs, EGMF achieves state-of-the-art performance across bilingual benchmarks—including MELD, CHERMA, MOSEI, and SIMS-V2—demonstrating exceptional cross-lingual robustness and generalizable affective representation capabilities.
📝 Abstract
Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.