Evaluating Disentangled Representations for Controllable Music Generation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the misalignment between disentangled representations in existing music generation models and their intended semantic meanings, which limits controllable generation. It presents the first multidimensional disentanglement evaluation framework for music audio, assessing representations along four axes—informativeness, equivariance, invariance, and disentanglement—through probing analyses and a range of unsupervised disentanglement strategies, including inductive biases, data augmentation, adversarial objectives, and staged training. The study systematically evaluates how well key attributes such as musical structure and timbre are captured in learned representations. Experimental results reveal significant discrepancies between the actual behavior of latent embeddings and their design intent, highlighting substantial challenges in achieving semantic consistency and offering new directions for improving controllable music generation mechanisms.

Technology Category

Application Category

📝 Abstract

Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.

Problem

Research questions and friction points this paper is trying to address.

disentangled representations

controllable music generation

embedding semantics

music audio models

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representations

controllable music generation

probing framework

equivariance