Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Latent spaces of audio autoencoders are typically nonlinear, hindering intuitive algebraic operations (e.g., mixing, scaling). Method: We propose an implicit regularization training strategy that requires no architectural or loss-function modifications; instead, it leverages carefully designed data augmentations to induce a latent space in Consistency Autoencoders (CAEs) that is simultaneously homogeneous, additive, and strictly gain-equivariant. Contribution/Results: To our knowledge, this is the first method achieving full encoder-decoder linearization while preserving high-fidelity reconstruction (STFT-MSE ≤ 0.01, PESQ ≥ 3.2). Experiments demonstrate that the resulting linear latent space directly enables simple arithmetic operations for music source synthesis and separation—yielding a 1.8 dB SDR improvement on MUSDB18 for source separation. The approach significantly enhances representation interpretability and controllability without compromising reconstruction quality.

Technology Category

Application Category

📝 Abstract

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.

Problem

Research questions and friction points this paper is trying to address.

Inducing linear latent spaces in audio autoencoders for intuitive manipulation

Enabling algebraic operations like mixing and scaling in compressed representations

Achieving linear behavior without altering model architecture or loss functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Induces linearity via implicit regularization technique

Uses data augmentation without altering model architecture

Enables latent arithmetic for audio composition and separation

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs