Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech foundation models lack systematic evaluation of vocal quality variations—such as hoarseness, breathiness, and creakiness—which constitute critical paralinguistic cues influencing emotional and social inference. Method: This study conducts the first systematic investigation into model sensitivity to vocal timbre, introducing a novel parallel speech dataset with synthetically modified vocal qualities to overcome limitations of conventional multiple-choice benchmarks; it employs open-ended generation, fine-grained emotion recognition, and contrastive analysis to assess response consistency across vocal conditions. Contribution/Results: We demonstrate significant behavioral inconsistency in mainstream speech foundation models under varying vocal qualities, revealing substantial performance degradation and inconsistent semantic or affective interpretations. These findings empirically validate vocal quality as a critical, previously overlooked dimension for evaluating speech foundation models—highlighting both its methodological necessity and practical significance for robust, socially aware speech understanding.

Technology Category

Application Category

📝 Abstract
Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.
Problem

Research questions and friction points this paper is trying to address.

Evaluating speech foundation models' sensitivity to voice quality variations
Assessing model consistency across creaky and breathy phonation types
Investigating paralinguistic feature impacts on speech model behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated SFMs using open-ended generation tasks
Assessed speech emotion recognition across phonation types
Created parallel dataset with synthesized voice quality modifications
🔎 Similar Papers
H
Harm Lameris
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
S
Shree Harsha Bokkahalli Satish
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
J
Joakim Gustafson
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Éva Székely
Éva Székely
Assistant Professor, KTH Royal Institute of Technology
speech technologyspeech synthesisdeep learninggenerative modellingbias detection