Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the insufficient synergy between general-purpose audio language models and domain-specific architectures in zero-shot speech emotion recognition. To this end, we propose ZS-Fuse, a novel approach that, for the first time, integrates emotion predictions from a dual-encoder audio language model with outputs from a specialized foundation model through a late-fusion strategy under a zero-shot setting. Our method innovatively incorporates prompt ensembling and prompt amplification techniques to enhance model robustness and unlock stronger zero-shot capabilities. Extensive experiments demonstrate that ZS-Fuse consistently outperforms current state-of-the-art methods—including WavLM-Large—across three standard speech emotion recognition benchmarks, thereby validating its effectiveness and generalization performance.

Technology Category

Application Category

📝 Abstract

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition

Audio-Language Models

Zero-Shot Learning

Prompt Sensitivity

Emotion Ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Amplification

Zero-Shot Late Fusion

Audio-Language Models

Speech Emotion Recognition

Foundation Models

🔎 Similar Papers

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

2024-09-23IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 1