VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitations of existing conversational emotion recognition methods, which often overlook critical visual emotional cues or suffer from interference by irrelevant background content and non-speaking individuals when leveraging vision-language models. Additionally, fine-tuning large models incurs substantial computational costs. To overcome these challenges, the paper introduces VISAFF, a novel speaker-centric visual emotion learning framework that operates without fine-tuning. VISAFF first isolates the speaker region to extract emotion-relevant visual features and then dynamically fuses textual and acoustic modalities through a multimodal reliability assessment mechanism to compensate for visual uncertainty. With all large model parameters frozen, the proposed approach achieves performance on par with state-of-the-art methods on two real-world conversational datasets while significantly improving computational efficiency.

📝 Abstract

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.

Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition in Conversation

Visual Affective Features

Speaker-Centered Modeling

Vision-Language Models

Multimodal Emotion Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-Centered

Tuning-Free

Vision-Language Models