OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Prior work lacks systematic evaluation of general-purpose multimodal large language models (Omni-LLMs) on zero-shot speech emotion recognition (SER), particularly across standard benchmarks with diverse input modalities. Method: This study conducts the first comprehensive zero-shot assessment of four representative Omni-LLMs on IEMOCAP and MELD, supporting audio-only, text-only, and audio-text joint inputs. We propose “acoustic prompting”—a novel paradigm integrating low-level acoustic feature modeling, dialogue context awareness, and stepwise reasoning—validated via ablation studies on context window size and error attribution analysis. Contribution/Results: Zero-shot Omni-LLMs achieve performance on par with or surpassing fine-tuned audio-specific models on both benchmarks, demonstrating strong cross-modal emotional understanding. Comparative experiments between minimal and chain-of-thought prompting further elucidate how prompt design governs multimodal cognitive processing. These findings establish acoustic prompting as an effective framework for zero-shot SER and advance understanding of multimodal reasoning in Omni-LLMs.

Technology Category

Application Category

📝 Abstract

The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating omni-LLMs for zero-shot emotion recognition

Comparing omni-LLMs with fine-tuned audio models

Developing acoustic prompting for better emotion analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-LLMs for zero-shot emotion recognition

Acoustic prompting for audio feature analysis

Context window analysis improves performance

🔎 Similar Papers

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models