AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limitations of current large language models in multimodal emotion recognition, which rely on static memory and struggle with fine-grained emotional understanding, while single-round retrieval is vulnerable to cross-modal dependencies and modality ambiguity. To overcome these challenges, the authors propose AffectAgent, a multi-agent framework comprising a query planner, an evidence filter, and an emotion generator that collaboratively reason for nuanced emotion interpretation. The approach introduces a shared affective reward mechanism and incorporates two novel components: Modality-Balanced Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), effectively mitigating cross-modal heterogeneity and semantic gaps under missing modalities. Trained end-to-end via MAPPO, the model achieves state-of-the-art performance on the MER-UniBench benchmark, demonstrating superior capabilities in multimodal emotion recognition.

Technology Category

Application Category

📝 Abstract

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

Problem

Research questions and friction points this paper is trying to address.

multimodal emotion recognition

retrieval-augmented generation

modal ambiguity

affective dependencies

cross-modal heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reasoning

retrieval-augmented generation

modality-balancing mixture of experts