Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Most neural speech codecs employ constant frame rates (CFR), failing to adapt to temporal variations in speech information density—e.g., between silent and voiced segments—leading to bitrate inefficiency and unnecessarily long token sequences, thereby hindering real-time performance. To address this, we propose Temporal-Flexible Coding (TFC), the first neural speech codec framework incorporating variable frame rate (VFR) encoding. TFC dynamically adjusts frame rate based on temporal entropy estimation, and introduces a dedicated neural vocoder architecture, frame-level codebook quantization, and a differentiable sampling mechanism—enabling seamless, information-driven frame allocation with continuously adjustable average frame rate. Experiments demonstrate that TFC significantly reduces token sequence length while preserving high speech reconstruction quality; even at low frame rates, it maintains competitive performance. TFC establishes a new paradigm for efficient, real-time speech coding and benefits downstream speech processing tasks.

Technology Category

Application Category

📝 Abstract
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Neural speech codecs lack temporal flexibility for varying speech densities
Constant frame rate hinders bitrate and sequence length efficiency
Dynamic frame rate allocation needed for optimal reconstruction quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces variable frame rate (VFR) coding
Dynamically allocates frame rates by entropy
Maintains quality at lower frame rates
🔎 Similar Papers
No similar papers found.
Hanglei Zhang
Hanglei Zhang
Shanghai Jiao Tong University
Y
Yiwei Guo
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
Zhihan Li
Zhihan Li
Kuaishou Technology, Tsinghua University
Anomaly DetectionAIOps
Xiang Hao
Xiang Hao
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, China
X
Xie Chen
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China
K
Kai Yu
MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China