Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural speech codecs lack systematic evaluation of reconstruction robustness and downstream task consistency (e.g., ASR, speech enhancement) under challenging acoustic conditions such as noise and reverberation. Method: We introduce ERSB—the first Environment-Robust Speech Codec Benchmark—featuring a dual-objective evaluation paradigm that jointly optimizes reconstruction fidelity and downstream consistency. Our framework incorporates multi-condition acoustic degradation simulation, fine-grained perceptual and spectral reconstruction assessment, and end-to-end consistency quantification. Contribution/Results: Experiments reveal substantial degradation in reconstruction quality for state-of-the-art codecs under realistic distortions, with downstream performance deviations exceeding 30%, exposing fundamental limitations. ERSB provides a reproducible benchmark, a principled evaluation framework, and concrete optimization directions for developing robust speech codecs. This work establishes a new standard for rigorous, task-aware codec evaluation in non-ideal acoustic environments.

Technology Category

Application Category

📝 Abstract
Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience.
Problem

Research questions and friction points this paper is trying to address.

Assessing neural speech codecs in complex acoustic environments
Evaluating reconstruction robustness and downstream task consistency
Improving environmental resilience of current speech codecs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Environment-Resilient Speech Codec Benchmark
Assesses robust reconstruction and task consistency
Highlights limitations of current speech codecs
🔎 Similar Papers
No similar papers found.
H
Haoran Wang
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
G
Guanyu Chen
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
B
Bohan Li
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Hankun Wang
Hankun Wang
Shanghai Jiao Tong University
Speech Synthesis
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
X
Xie Chen
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China