AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack standardized, systematic evaluation protocols for affective reasoning—particularly regarding audiovisual cue integration and dialog-level emotional coherence. Method: We introduce AV-EmoBench, the first benchmark tailored for affective reasoning in multimodal LLMs. It features a multi-dimensional evaluation framework spanning continuity, classification, and perception; employs multi-turn dialogues drawn from both synthetic and real-world scenarios; supports comparative assessment of unimodal versus multimodal inputs; and combines automated metrics with human perceptual validation. Results: Experiments reveal that visual cues substantially improve emotional coherence, multimodal inputs enhance affective expressiveness in speech generation, and the proposed metrics exhibit strong complementarity. AV-EmoBench establishes a reproducible, scalable paradigm for evaluating multimodal affective intelligence, enabling rigorous, cross-model comparison and targeted advancement in affective reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Emotions conveyed through voice and face shape engagement and context in human-AI interaction. Despite rapid progress in omni-modal large language models (LLMs), the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV-EMO-Reasoning, a benchmark designed to systematically assess emotional coherence in LLMs. The framework leverages a curated, single- and multi-turn synthetic audiovisual corpus with a real-world set and is assessed under continuous, categorical, and perceptual metrics. Experiments with leading LLMs show that visual cues reliably improve emotional coherence over audio-only baselines. Moreover, LLMs can leverage audio-visual cues to generate more emotion-aware speech. Models exhibit complementary strengths across metric families, indicating that automatic scores capture facets distinct from perceptual judgments. By releasing a systematic evaluation benchmark, AV-EMO-Reasoning offers a reproducible standard for evaluating emotion-aware dialogue and advances toward more natural, adaptive human-AI interaction.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking emotional reasoning in omni-modal LLMs

Assessing emotional coherence using audiovisual cues

Evaluating emotion-aware dialogue for human-AI interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates emotional reasoning with audiovisual cues

Framework uses synthetic and real-world audiovisual datasets

Models generate emotion-aware speech using multimodal inputs

🔎 Similar Papers

StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models