🤖 AI Summary
This work addresses the trade-off in audio variational autoencoders (VAEs) between over-regularization, which degrades generation quality, and under-regularization, which harms downstream task performance. The authors propose a target KL regularization mechanism that, for the first time, enables controllable training of audio VAEs at specified bitrates. Leveraging rate–distortion theory, they construct rate–distortion curves over the continuous latent space, establishing a fair basis for comparison with discrete neural audio codecs. The approach is evaluated in text-to-sound generation, where the impact of varying compression rates on synthesis quality is systematically analyzed, revealing an optimal configuration that achieves a significant balance between audio fidelity and the predictability of latent representations.
📝 Abstract
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.