A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unified modeling for discrete emotion recognition and continuous affective analysis in multimodal sentiment understanding by proposing an Expert-Guided Multimodal Fusion (EGMF) framework. EGMF employs three specialized expert networks to capture local details, cross-modal semantic associations, and global contextual information, respectively. A hierarchical dynamic gating mechanism enables adaptive feature fusion, while pseudo-token injection and prompt conditioning facilitate the integration of enhanced representations into a large language model (LLM). This allows both classification and regression tasks to be handled in a generative manner within a unified architecture. As the first approach to combine dynamic multi-expert fusion with LLMs, EGMF achieves state-of-the-art performance across bilingual benchmarks—including MELD, CHERMA, MOSEI, and SIMS-V2—demonstrating exceptional cross-lingual robustness and generalizable affective representation capabilities.

Technology Category

Application Category

📝 Abstract
Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
Problem

Research questions and friction points this paper is trying to address.

emotion recognition
sentiment analysis
multimodal fusion
large language models
cross-lingual robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

expert-guided multimodal fusion
large language models
hierarchical dynamic gating
pseudo token injection
cross-lingual emotion recognition
🔎 Similar Papers
No similar papers found.
J
Jiaqi Qiao
School of Software, Dalian University of Technology, China
X
Xiujuan Xu
School of Software, Dalian University of Technology, China
Xinran Li
Xinran Li
Dalian University of Technology
NLPLLMsERC
Yu Liu
Yu Liu
Dalian University of Technology
computer visionmultimodal learning