A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of unified modeling for discrete emotion recognition and continuous affective analysis in multimodal sentiment understanding by proposing an Expert-Guided Multimodal Fusion (EGMF) framework. EGMF employs three specialized expert networks to capture local details, cross-modal semantic associations, and global contextual information, respectively. A hierarchical dynamic gating mechanism enables adaptive feature fusion, while pseudo-token injection and prompt conditioning facilitate the integration of enhanced representations into a large language model (LLM). This allows both classification and regression tasks to be handled in a generative manner within a unified architecture. As the first approach to combine dynamic multi-expert fusion with LLMs, EGMF achieves state-of-the-art performance across bilingual benchmarks—including MELD, CHERMA, MOSEI, and SIMS-V2—demonstrating exceptional cross-lingual robustness and generalizable affective representation capabilities.

Technology Category

Application Category

📝 Abstract

Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.

Problem

Research questions and friction points this paper is trying to address.

emotion recognition

sentiment analysis

multimodal fusion

large language models

cross-lingual robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

expert-guided multimodal fusion

large language models

hierarchical dynamic gating