Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of ultra-low-bitrate audio coding for machine perception. We propose a task-oriented residual vector quantization (RVQ) method to compress and quantize intermediate feature representations from pretrained speech/audio models. Unlike conventional paradigms optimized for perceptual fidelity, our approach directly incorporates downstream-task-specific losses—such as ASR word error rate or audio classification accuracy—as optimization objectives for quantization, enabling joint bitrate–performance optimization. The framework supports multi-bitrate adaptation and cross-model-scale transferability. Evaluated on automatic speech recognition and audio classification tasks, it achieves compression rates below 200 bps while retaining over 99% of the original model’s performance—significantly outperforming state-of-the-art neural audio codecs.

Technology Category

Application Category

📝 Abstract
Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
Problem

Research questions and friction points this paper is trying to address.

Efficient audio compression for machine tasks, not human perception
Minimizing performance loss in downstream models at ultra-low bitrates
Adaptable tokenizer for various bitrates and model sizes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific loss guidance with RVQ
Ultra-low bitrates under 200 bps
Adaptable tokenizer for various deployments
🔎 Similar Papers
No similar papers found.
Anastasia Kuznetsova
Anastasia Kuznetsova
PhD, Computer Science, Indiana University
Speech and Audio processing
Inseon Jang
Inseon Jang
Electronics and Telecommunications Research Institute
audio signal processingaudio coding
W
Wootaek Lim
University of Illinois Urbana-Champaign, Champaign, IL, USA
M
Minje Kim
Electronics and Telecommunications Research Institute, Daejeon, Korea