SAME: A Semantically-Aligned Music Autoencoder

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of preserving high-fidelity reconstruction and downstream generative performance for music and general audio under extreme temporal compression ratios—up to 4096×. To this end, the authors propose SAME, a semantic-aligned autoencoder tailored for stereo audio, which uniquely integrates semantic regularization, phase-aware reconstruction loss, and an enhanced discriminator within a Transformer-based architecture to enable efficient representation learning. The study releases two variants: SAME-L, a high-performance model, and SAME-S, a lightweight version deployable on CPU. Both achieve substantial reductions in computational overhead while maintaining excellent audio fidelity and generative capability.

📝 Abstract

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Problem

Research questions and friction points this paper is trying to address.

latent representation

audio compression

music autoencoder

temporal compression ratio

stereo audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantically-Aligned

Music Autoencoder

Temporal Compression