Learning Neural Vocoder from Range-Null Space Decomposition

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural vocoders face challenges including opaque modeling mechanisms and difficulty balancing parameter efficiency with synthesis quality. To address these, we propose the first time-frequency-domain neural vocoder grounded in range-nullspace decomposition theory, which decouples spectral reconstruction into a linear mapping in the range space (preserving core structural fidelity) and detail generation in the nullspace (capturing high-frequency components). Methodologically, we design a dual-path hierarchical encoder-decoder architecture integrating cross-band/narrowband subband modeling, a Mel-to-linear spectrogram domain-transfer network, and a learnable nullspace completion module. Evaluated on LJSpeech and LibriTTS, our model achieves state-of-the-art reconstruction quality with significantly reduced parameter count, while simultaneously improving speech naturalness and fidelity. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the above-mentioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the cross- and narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and LibriTTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.
Problem

Research questions and friction points this paper is trying to address.

Resolving opaque modeling in neural vocoders
Addressing parameter-performance trade-off in vocoders
Enhancing spectral detail generation in vocoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-frequency domain-based neural vocoder
Range-null decomposition theory integration
Dual-path hierarchical encoding framework
🔎 Similar Papers
No similar papers found.
A
Andong Li
Institute of Acoustics, Chinese Academy of Sciences; University of Chinese Academy of Sciences
T
Tong Lei
Tencent AI Lab; Nanjing University
Z
Zhihang Sun
Tencent AI Lab
R
Rilin Chen
Tencent AI Lab
E
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences (AMS); Tianjin Artificial Intelligence Innovation Center (TAIIC)
X
Xiaodong Li
Institute of Acoustics, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Chengshi Zheng
Chengshi Zheng
Institute of Acoustics, Chinese Academy of Sciences
Speech enhancementmicrophone arraydeep learning