Analysis of Speaker Verification Performance Trade-offs with Neural Audio Codec Transmission

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study systematically evaluates the impact of neural audio codecs (NACs) on speaker verification (SV) performance, benchmarking them against traditional codecs (e.g., Opus) across multiple bitrates. Using the VoxCeleb1 dataset, we assess three state-of-the-art SV models—ECAPA-TDNN, ResNet34, and RawNet3—under compression by various NACs and Opus at bitrates spanning 2–48 kbps, measuring equal error rate (EER) degradation. Results show that NACs significantly outperform Opus at low bitrates (<12 kbps), reducing EER by 6–8%, while exhibiting only marginal degradation (0.4–0.7% EER increase) near 24 kbps—demonstrating overall SV robustness preservation. The underlying cause is identified as perceptual optimization in NACs, which inadvertently attenuates speaker-discriminative acoustic features. To our knowledge, this is the first work to reveal the bitrate-dependent trade-off between perceptual fidelity and speaker identity preservation in NACs, and it proposes a novel research direction: speaker-aware codec design and SV-model-aware codec evaluation.

Technology Category

Application Category

📝 Abstract

Neural audio codecs (NACs) have made significant advancements in recent years and are rapidly being adopted in many audio processing pipelines. However, they can introduce audio distortions which degrade speaker verification (SV) performance. This study investigates the impact of both traditional and neural audio codecs at varying bitrates on three state of-the-art SV models evaluated on the VoxCeleb1 dataset. Our findings reveal a consistent degradation in SV performance across all models and codecs as bitrates decrease. Notably, NACs do not fundamentally break SV performance when compared to traditional codecs. They outperform Opus by 6-8% at low-bitrates (< 12 kbps) and remain marginally behind at higher bitrates ($approx$ 24 kbps), with an EER increase of only 0.4-0.7%. The disparity at higher bitrates is likely due to the primary optimization of NACs for perceptual quality, which can inadvertently discard critical speaker-discriminative features, unlike Opus which was designed to preserve vocal characteristics. Our investigation suggests that NACs are a feasible alternative to traditional codecs, especially under bandwidth limitations. To bridge the gap at higher bitrates, future work should focus on developing speaker-aware NACs or retraining and adapting SV models.

Problem

Research questions and friction points this paper is trying to address.

Impact of neural audio codecs on speaker verification performance

Trade-offs between codec bitrates and verification accuracy

Comparison of neural and traditional codecs for speaker recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated neural audio codecs on speaker verification

Compared NACs with traditional codecs at varying bitrates

Proposed speaker-aware NACs to preserve vocal features

🔎 Similar Papers

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification