WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing speech representations struggle to simultaneously support understanding and generation: self-supervised learning (SSL) features exhibit strong semantics but lack controllability in generation, while acoustic reconstruction features suffer from semantic inconsistency. To address this, this work proposes WavCube—a compact, continuous latent representation that jointly models semantic and acoustic information through a two-stage training framework. First, a semantic bottleneck filters out redundant information; then, end-to-end reconstruction injects fine-grained acoustic details, guided by a semantic anchoring loss. WavCube is the first unified framework enabling speech understanding, reconstruction, and generation. Experiments show that at 8× compression, WavCube nearly matches WavLM’s performance on SUPERB, achieves reconstruction quality comparable to specialized acoustic representations, attains state-of-the-art zero-shot text-to-speech synthesis with faster convergence, and excels across multiple SUPERB-SG tasks.

📝 Abstract

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.

Problem

Research questions and friction points this paper is trying to address.

speech representation

unified speech models

semantic-acoustic modeling

self-supervised learning

speech understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-acoustic joint modeling

unified speech representation

two-stage training