🤖 AI Summary
Latent spaces of audio autoencoders are typically nonlinear, hindering intuitive algebraic operations (e.g., mixing, scaling). Method: We propose an implicit regularization training strategy that requires no architectural or loss-function modifications; instead, it leverages carefully designed data augmentations to induce a latent space in Consistency Autoencoders (CAEs) that is simultaneously homogeneous, additive, and strictly gain-equivariant. Contribution/Results: To our knowledge, this is the first method achieving full encoder-decoder linearization while preserving high-fidelity reconstruction (STFT-MSE ≤ 0.01, PESQ ≥ 3.2). Experiments demonstrate that the resulting linear latent space directly enables simple arithmetic operations for music source synthesis and separation—yielding a 1.8 dB SDR improvement on MUSDB18 for source separation. The approach significantly enhances representation interpretability and controllability without compromising reconstruction quality.
📝 Abstract
Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.