Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses two key bottlenecks in video emotion understanding using large multimodal models (LMMs): weak fine-grained emotional modeling and catastrophic forgetting during fine-tuning. To this end, we propose VER, a unified vision-language framework. Methodologically, VER introduces: (1) the first emotion-general dual-path Hybrid Mixture-of-Experts (MoE) compressor, jointly optimizing task-specificity and generalization; (2) a three-stage collaborative pretraining paradigm; and (3) VER, the first large-scale bilingual video emotion reasoning dataset (40K+ samples). Built upon the Qwen architecture, VER integrates cross-modal feature alignment with joint modeling of multi-source inputs (images, videos, text, and audio). Empirically, VER achieves state-of-the-art performance across multiple emotion recognition benchmarks while retaining strong generalization on mainstream vision-language tasks. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Emotion understanding in videos aims to accurately recognize and interpret individuals' emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks often leads to catastrophic forgetting, hindering their ability to generalize across diverse tasks. To address these challenges, we present Emotion-Qwen, a tailored multimodal framework designed to enhance both emotion understanding and general VL reasoning. Emotion-Qwen incorporates a sophisticated Hybrid Compressor based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing. The model is pre-trained in a three-stage pipeline on large-scale general and emotional image datasets to support robust multimodal representations. Furthermore, we construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen's emotional reasoning capability. Experimental results demonstrate that Emotion-Qwen achieves state-of-the-art performance on multiple emotion recognition benchmarks, while maintaining competitive results on general VL tasks. Code and models are available at https://anonymous.4open.science/r/Emotion-Qwen-Anonymous.

Problem

Research questions and friction points this paper is trying to address.

Enhancing emotion understanding in videos using multimodal cues

Preventing catastrophic forgetting in emotion-tuned vision-language models

Balancing emotion-specific and general-purpose multimodal processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Compressor using Mixture of Experts

Three-stage pre-training on multimodal datasets

Video Emotion Reasoning dataset with 40K clips

🔎 Similar Papers

Contextual Emotion Recognition using Large Vision Language Models