Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the challenge of achieving both low latency and high fidelity for neural vocoders in CPU-constrained real-time speech synthesis, this paper proposes MS-Wavehax, a multi-stream decomposition architecture. We systematically identify and overcome key bottlenecks of streaming causal inference on CPU—namely limited parallelism, inter-frame dependency coupling, and parameter loading overhead. Our approach introduces four core techniques: multi-stream causal decomposition, dynamic block-size optimization, lightweight parameter loading, and inter-frame dependency decoupling—all while preserving Wavehax’s alias-free property. Experiments demonstrate that MS-Wavehax achieves sub-3 MB model size, millisecond-level end-to-end inference latency on edge devices, and a mean opinion score (MOS) ≥ 4.1—matching the audio quality of non-causal models. The method attains Pareto-optimal trade-offs between throughput and latency.

Technology Category

Application Category

📝 Abstract

In real-time speech synthesis, neural vocoders often require low-latency synthesis through causal processing and streaming. However, streaming introduces inefficiencies absent in batch synthesis, such as limited parallelism, inter-frame dependency management, and parameter loading overhead. This paper proposes multi-stream Wavehax (MS-Wavehax), an efficient neural vocoder for low-latency streaming, by extending the aliasing-free neural vocoder Wavehax with multi-stream decomposition. We analyze the latency-throughput trade-off in a CPU-only environment and identify key bottlenecks in streaming neural vocoders. Our findings provide practical insights for optimizing chunk sizes and designing vocoders tailored to specific application demands and hardware constraints. Furthermore, our subjective evaluations show that MS-Wavehax delivers high speech quality under causal and non-causal conditions while being remarkably compact and easily deployable in resource-constrained environments.

Problem

Research questions and friction points this paper is trying to address.

Optimizing low-latency streaming synthesis for neural vocoders

Addressing inefficiencies in real-time speech synthesis under constraints

Balancing latency-throughput trade-offs in CPU-only environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends Wavehax with multi-stream decomposition

Optimizes chunk sizes for specific demands

Ensures high speech quality compactly

🔎 Similar Papers

No similar papers found.