π€ AI Summary
This work addresses the challenge of preserving high-fidelity reconstruction and downstream generative performance for music and general audio under extreme temporal compression ratiosβup to 4096Γ. To this end, the authors propose SAME, a semantic-aligned autoencoder tailored for stereo audio, which uniquely integrates semantic regularization, phase-aware reconstruction loss, and an enhanced discriminator within a Transformer-based architecture to enable efficient representation learning. The study releases two variants: SAME-L, a high-performance model, and SAME-S, a lightweight version deployable on CPU. Both achieve substantial reductions in computational overhead while maintaining excellent audio fidelity and generative capability.
π Abstract
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.