🤖 AI Summary
This work addresses the unclear mechanisms underlying representational geometric transformations in self-supervised speech models under both natural and adversarial perturbations, as well as the lack of effective tools for monitoring local structural changes. The authors propose GRIDS, a novel framework that introduces Local Intrinsic Dimensionality (LID) to analyze self-supervised speech representations for the first time. By computing LID layer-wise for WavLM and wav2vec 2.0, they characterize local geometric distortions and correlate them with automatic speech recognition performance degradation. The study reveals that low signal-to-noise ratio (SNR) perturbations generally increase LID, while at high SNR, adversarial examples sustain elevated LID in early layers unlike benign noise. Critically, LID dynamics closely align with word error rate trends, and an LID-based, text-free anomaly detection method achieves AUROC scores between 0.78 and 1.00.
📝 Abstract
Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation? To address this, we present GRIDS, a framework using Local Intrinsic Dimensionality (LID) across layer-wise representations in WavLM and wav2vec 2.0. We find that LID increases for all low signal-to noise ratio (SNR) perturbations and diverges at high SNR: benign noise converges toward the clean profile, while adversarial inputs retain early-layer LID elevation. We show LID elevation co-occurs with increased WER, and that layer-wise LID features enable anomaly detection (AUROC 0.78-1.00), opening the door to transcript-free monitoring in S3Ms.