🤖 AI Summary
Prior work lacks systematic evaluation of general-purpose multimodal large language models (Omni-LLMs) on zero-shot speech emotion recognition (SER), particularly across standard benchmarks with diverse input modalities.
Method: This study conducts the first comprehensive zero-shot assessment of four representative Omni-LLMs on IEMOCAP and MELD, supporting audio-only, text-only, and audio-text joint inputs. We propose “acoustic prompting”—a novel paradigm integrating low-level acoustic feature modeling, dialogue context awareness, and stepwise reasoning—validated via ablation studies on context window size and error attribution analysis.
Contribution/Results: Zero-shot Omni-LLMs achieve performance on par with or surpassing fine-tuned audio-specific models on both benchmarks, demonstrating strong cross-modal emotional understanding. Comparative experiments between minimal and chain-of-thought prompting further elucidate how prompt design governs multimodal cognitive processing. These findings establish acoustic prompting as an effective framework for zero-shot SER and advance understanding of multimodal reasoning in Omni-LLMs.
📝 Abstract
The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.