Learning to Upsample and Upmix Audio in the Latent Domain

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio up-sampling and up-mixing methods typically operate in the waveform or spectral domain, incurring high computational cost and low efficiency. This work proposes the first end-to-end latent-space audio up-sampling and up-mixing framework, directly performing bandwidth extension and mono-to-stereo conversion within the latent representations of a pre-trained neural audio autoencoder—thereby eliminating iterative encoding/decoding. Our method employs only a latent-space L1 reconstruction loss and a single latent-domain adversarial discriminator, dispensing with conventional multi-scale waveform losses. Experiments demonstrate up to 100× speedup over waveform-domain state-of-the-art methods on both tasks, while achieving comparable perceptual audio quality. Moreover, it significantly reduces computational and memory overhead. By shifting the entire processing pipeline into the latent space, our approach establishes a new paradigm for efficient neural audio processing.

Technology Category

Application Category

📝 Abstract
Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder's latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.
Problem

Research questions and friction points this paper is trying to address.

Perform audio processing in latent space efficiently
Eliminate need for decoding to raw audio formats
Achieve computational efficiency while maintaining audio quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Performs audio processing in latent space
Uses latent L1 reconstruction and discriminator
Achieves 100x efficiency with quality
🔎 Similar Papers
No similar papers found.