VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing bridge models for speech enhancement are typically constrained to single-task settings or small-scale datasets, limiting their applicability to large-scale general speech restoration (GSR). To address this, we propose the Latent Bridge model—the first framework enabling unified, high-fidelity, full-band speech reconstruction under diverse degradations (e.g., noise, reverberation, bandwidth limitation). Methodologically, we design an energy-preserving variational autoencoder coupled with a joint neural prior to model a continuous latent space; adopt a scalable Transformer architecture; and introduce a perception-driven fine-tuning strategy. Our approach significantly improves speech intelligibility and naturalness on both in-domain and out-of-domain tasks, supports zero-shot generalization, and demonstrates strong robustness and practical utility in real-world scenarios—including podcast enhancement and synthetic speech optimization.

Technology Category

Application Category

📝 Abstract
Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band ( extit{i.e.,} 48~kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the~ extit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~ extit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets ( extit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Developing a general speech restoration system for multiple distortion types
Modeling diverse low-to-high quality tasks with single generative process
Enhancing perceptual quality in speech reconstruction from various distortions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent bridge models for general speech restoration
Energy-preserving variational autoencoder for latent alignment
Perceptually aware fine-tuning for human perceptual quality
🔎 Similar Papers
No similar papers found.
C
Chi Zhang
Tsinghua University
Zehua Chen
Zehua Chen
PostDoc at Tsinghua University | Ph.D. from Imperial College
Generative ModelsMulti-modal GenerationHealth Monitoring
K
Kaiwen Zheng
Tsinghua University
J
Jun Zhu
Tsinghua University