Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Most neural speech codecs employ constant frame rates (CFR), failing to adapt to temporal variations in speech information density—e.g., between silent and voiced segments—leading to bitrate inefficiency and unnecessarily long token sequences, thereby hindering real-time performance. To address this, we propose Temporal-Flexible Coding (TFC), the first neural speech codec framework incorporating variable frame rate (VFR) encoding. TFC dynamically adjusts frame rate based on temporal entropy estimation, and introduces a dedicated neural vocoder architecture, frame-level codebook quantization, and a differentiable sampling mechanism—enabling seamless, information-driven frame allocation with continuously adjustable average frame rate. Experiments demonstrate that TFC significantly reduces token sequence length while preserving high speech reconstruction quality; even at low frame rates, it maintains competitive performance. TFC establishes a new paradigm for efficient, real-time speech coding and benefits downstream speech processing tasks.

Technology Category

Application Category

📝 Abstract

Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Neural speech codecs lack temporal flexibility for varying speech densities

Constant frame rate hinders bitrate and sequence length efficiency

Dynamic frame rate allocation needed for optimal reconstruction quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces variable frame rate (VFR) coding

Dynamically allocates frame rates by entropy

Maintains quality at lower frame rates

🔎 Similar Papers

No similar papers found.