🤖 AI Summary
In deep variational Bayesian symbolic music generation, jointly optimizing the Kullback–Leibler divergence (KLD) and attribute regularization (AR) loss remains challenging: excessive KLD constraints degrade controllability, while relaxing them compromises the standard normal prior over the latent space. To address this trade-off, we propose a joint regularization framework based on learnable attribute transformations, embedded within the variational information bottleneck paradigm. Building upon reconstruction loss, our method employs a nonlinear attribute transformation module to dynamically balance KLD and AR terms—replacing rigid linear weighting with adaptive coordination. Experiments demonstrate significant improvements in both generation quality and control accuracy across multiple continuous musical attributes (e.g., tempo, density, tonal strength), while preserving latent distributions closely approximating the standard normal. Thus, our approach achieves unified optimization of controllability and regularization.
📝 Abstract
Explicit latent variable models provide a flexible yet powerful framework for data synthesis, enabling controlled manipulation of generative factors. With latent variables drawn from a tractable probability density function that can be further constrained, these models enable continuous and semantically rich exploration of the output space by navigating their latent spaces. Structured latent representations are typically obtained through the joint minimization of regularization loss functions. In variational information bottleneck models, reconstruction loss and Kullback-Leibler Divergence (KLD) are often linearly combined with an auxiliary Attribute-Regularization (AR) loss. However, balancing KLD and AR turns out to be a very delicate matter. When KLD dominates over AR, generative models tend to lack controllability; when AR dominates over KLD, the stochastic encoder is encouraged to violate the standard normal prior. We explore this trade-off in the context of symbolic music generation with explicit control over continuous musical attributes. We show that existing approaches struggle to jointly minimize both regularization objectives, whereas suitable attribute transformations can help achieve both controllability and regularization of the target latent dimensions.