🤖 AI Summary
Addressing the challenge of balancing efficiency and quality in neural speech coding under extreme low-resource constraints (<700 MFLOPs, <30 ms latency, dual-bitrate support at 1/6 kbps), this paper proposes an efficient neural speech codec framework. Methodologically, it introduces: (1) an optimized asymmetric frequency-time encoder-decoder architecture that alleviates resource dispersion bottlenecks in conventional decoders; (2) a cyclic calibration and refinement (CCR) training strategy to enhance waveform reconstruction fidelity; and (3) noise-invariant fine-tuning to improve robustness in challenging acoustic conditions (e.g., real-world noise and reverberation). Evaluated on the LRAC 2025 Challenge, the framework achieves third place overall and attains state-of-the-art intelligibility performance—both on clean speech and under realistic noisy/reverberant conditions—at 1 kbps.
📝 Abstract
This paper presents PhoenixCodec, a comprehensive neural speech coding and decoding framework designed for extremely low-resource conditions. The proposed system integrates an optimized asymmetric frequency-time architecture, a Cyclical Calibration and Refinement (CCR) training strategy, and a noise-invariant fine-tuning procedure. Under stringent constraints - computation below 700 MFLOPs, latency less than 30 ms, and dual-rate support at 1 kbps and 6 kbps - existing methods face a trade-off between efficiency and quality. PhoenixCodec addresses these challenges by alleviating the resource scattering of conventional decoders, employing CCR to escape local optima, and enhancing robustness through noisy-sample fine-tuning. In the LRAC 2025 Challenge Track 1, the proposed system ranked third overall and demonstrated the best performance at 1 kbps in both real-world noise and reverberation and intelligibility in clean tests, confirming its effectiveness.