🤖 AI Summary
This work addresses automatic speech recognition (ASR) for low-resource languages, focusing on Campidanese Sardinian, and reveals that the final output layer of Wav2Vec2 models often introduces phonetic errors due to overgeneralization, thereby obscuring more accurate phoneme representations present in intermediate layers. By performing layer-wise decoding of the pretrained Wav2Vec2 encoder and integrating phoneme alignment with fine-grained error categorization, the study introduces the concept of “regressive errors,” wherein deeper layers overwrite correct predictions from shallower ones. Experimental results demonstrate that the lowest phoneme error rate (PER) is achieved at the third-to-last layer—significantly outperforming the final layer—and that intermediate layers exhibit fewer instances of overgeneration and phonological inaccuracies. These findings challenge the conventional “deeper-is-better” assumption and suggest a novel optimization pathway for low-resource ASR systems.
📝 Abstract
Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.