🤖 AI Summary
This work proposes a unified, generation-first neural audio autoencoder architecture that overcomes key limitations of existing approaches—namely slow encoding speed, high latent representation rates, and the need for separate models tailored to different audio channel configurations or discrete/continuous latent variables. The proposed model supports multi-channel audio formats and both discrete and continuous latent representations within a single framework, while achieving an unprecedented time-domain downsampling factor of 3360×. It maintains high-fidelity reconstruction quality while accelerating encoding by 10× and reducing latent bitrate by 1.6×, compressing a 60-second mono audio clip into only 788 tokens. This dramatic compression substantially lowers the computational burden for downstream generative modeling tasks.
📝 Abstract
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.