Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of clarity regarding the acoustic cues underlying modern Conformer-based automatic speech recognition (ASR) models. We systematically investigate their time-frequency perceptual preferences across plosives, fricatives, and vowels. Using Integrated Gradients for feature attribution, we perform cross-modal interpretability validation grounded in phonetic priors—including vowel formants (F1/F2), sibilant spectral shape, and plosive release bursts. Our key findings are: (1) the model relies consistently on vowel formants throughout utterances; (2) it exhibits high sensitivity to high-frequency spectral details in sibilants; (3) it prioritizes plosive release bursts over transitional cues; and (4) it demonstrates gender-dependent acoustic sensitivities. These results enhance the interpretability of ASR models and provide acoustically grounded guidance for improving robustness.

Technology Category

Application Category

📝 Abstract
Despite significant advances in ASR, the specific acoustic cues models rely on remain unclear. Prior studies have examined such cues on a limited set of phonemes and outdated models. In this work, we apply a feature attribution technique to identify the relevant acoustic cues for a modern Conformer-based ASR system. By analyzing plosives, fricatives, and vowels, we assess how feature attributions align with their acoustic properties in the time and frequency domains, also essential for human speech perception. Our findings show that the ASR model relies on vowels' full time spans, particularly their first two formants, with greater saliency in male speech. It also better captures the spectral characteristics of sibilant fricatives than non-sibilants and prioritizes the release phase in plosives, especially burst characteristics. These insights enhance the interpretability of ASR models and highlight areas for future research to uncover potential gaps in model robustness.
Problem

Research questions and friction points this paper is trying to address.

Identify acoustic cues for modern ASR systems
Analyze feature attributions for plosives, fricatives, and vowels
Enhance ASR interpretability and uncover model robustness gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature attribution for acoustic cue identification
Analyzes plosives, fricatives, vowels in time-frequency domains
Focuses on Conformer-based ASR model interpretability
🔎 Similar Papers
No similar papers found.