SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work proposes SemanticVocoder, a novel approach to audio generation that directly leverages semantic latent variables extracted by a dedicated semantic encoder for end-to-end waveform synthesis, thereby circumventing the limitations of conventional variational autoencoder (VAE)-based models that rely on acoustically derived latent representations. By abandoning the traditional VAE architecture and unifying the semantic foundation for both audio understanding and generation, SemanticVocoder effectively captures high-level semantic content, mitigating the semantic ambiguity commonly observed in generated audio. Evaluated on the AudioCaps test set, the method achieves a Fréchet Distance of 12.823 and a Fréchet Audio Distance of 1.709, substantially outperforming existing approaches based on acoustic latent spaces in both generation quality and semantic discriminability.

Technology Category

Application Category

📝 Abstract

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.

Problem

Research questions and friction points this paper is trying to address.

audio generation

semantic latents

VAE latents

acoustic details

semantic disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Latents

Generative Vocoder

Audio Generation