Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key bottlenecks in video emotion understanding using large multimodal models (LMMs): weak fine-grained emotional modeling and catastrophic forgetting during fine-tuning. To this end, we propose VER, a unified vision-language framework. Methodologically, VER introduces: (1) the first emotion-general dual-path Hybrid Mixture-of-Experts (MoE) compressor, jointly optimizing task-specificity and generalization; (2) a three-stage collaborative pretraining paradigm; and (3) VER, the first large-scale bilingual video emotion reasoning dataset (40K+ samples). Built upon the Qwen architecture, VER integrates cross-modal feature alignment with joint modeling of multi-source inputs (images, videos, text, and audio). Empirically, VER achieves state-of-the-art performance across multiple emotion recognition benchmarks while retaining strong generalization on mainstream vision-language tasks. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Emotion understanding in videos aims to accurately recognize and interpret individuals' emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks often leads to catastrophic forgetting, hindering their ability to generalize across diverse tasks. To address these challenges, we present Emotion-Qwen, a tailored multimodal framework designed to enhance both emotion understanding and general VL reasoning. Emotion-Qwen incorporates a sophisticated Hybrid Compressor based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing. The model is pre-trained in a three-stage pipeline on large-scale general and emotional image datasets to support robust multimodal representations. Furthermore, we construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen's emotional reasoning capability. Experimental results demonstrate that Emotion-Qwen achieves state-of-the-art performance on multiple emotion recognition benchmarks, while maintaining competitive results on general VL tasks. Code and models are available at https://anonymous.4open.science/r/Emotion-Qwen-Anonymous.
Problem

Research questions and friction points this paper is trying to address.

Enhancing emotion understanding in videos using multimodal cues
Preventing catastrophic forgetting in emotion-tuned vision-language models
Balancing emotion-specific and general-purpose multimodal processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Compressor using Mixture of Experts
Three-stage pre-training on multimodal datasets
Video Emotion Reasoning dataset with 40K clips
🔎 Similar Papers
No similar papers found.
D
Dawei Huang
Shenzhen Technology University, Shenzhen, China
Q
Qing Li
Shenzhen Technology University, Shenzhen, China
C
Chuan Yan
Stanford University, San Francisco, America
Zebang Cheng
Zebang Cheng
Shenzhen University
AICVMLLMAffective Computing
Y
Yurong Huang
University of Electronic Science and Technology of China, Chengdu, China
X
Xiang Li
Shenzhen Technology University, Shenzhen, China
B
Bin Li
Skyworth Digital, Shenzhen, China
X
Xiaohui Wang
Shenzhen Xiaopai Tech Co, Shenzhen, China
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning
Xiaojiang Peng
Xiaojiang Peng
Shenzhen Technology University
Computer VisionFacial Expression RecognitionMultimodal Emotion Recognition