Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the representational capacity of the CNN front-end in Wav2Vec 2.0 for monophthong acoustic properties, specifically focusing on front–back vowel discrimination. Using the TIMIT corpus, we extract activation values from each CNN layer and compare them against hand-crafted features—MFCCs and MFCCs augmented with formant frequencies—using SVM classification accuracy as an interpretable, quantitative evaluation metric. Layer-wise representational analysis reveals that low-level CNN activations achieve classification performance on par with or superior to traditional features, demonstrating implicit encoding of critical acoustic cues (e.g., F1/F2 distributions). This systematic, layer-resolved framework provides an interpretable and quantifiable lens into the internal representation mechanisms of self-supervised speech models. The findings underscore the efficacy and phonemic representational potential of Wav2Vec 2.0’s early-layer features for segmental speech analysis.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating vowel representation in Wav2Vec's CNN feature extractor
Comparing MFCCs and CNN activations for vowel classification
Assessing phonetic information in self-supervised speech models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise CNN activation analysis
MFCC and formant feature integration
SVM classifier accuracy evaluation
🔎 Similar Papers
No similar papers found.