🤖 AI Summary
This work addresses the problem of manual hyperparameter tuning required to balance reconstruction loss and disentanglement loss in disentangled representation learning. Methodologically, it proposes an end-to-end learnable dynamic weighting mechanism within the β-VAE framework: introducing differentiable, trainable loss weights and incorporating a gradient-aware regularization term to mitigate optimization bias in weight learning, thereby enabling joint optimization of model parameters and loss weights. The key contribution lies in embedding hyperparameters—specifically, loss weights—into the differentiable training pipeline, thus simultaneously optimizing for both disentanglement quality and reconstruction fidelity. Experiments demonstrate state-of-the-art or competitive disentanglement performance on standard benchmarks (e.g., dSprites, MPI3D), as measured by metrics such as DCI and MIG. Moreover, the method achieves effective unsupervised disentanglement of facial attributes—including pose and expression—on CelebA, validating its generalizability to real-world image data.
📝 Abstract
In this paper, we propose a novel model called Learnable VAE (L-VAE), which learns a disentangled representation together with the hyperparameters of the cost function. L-VAE can be considered as an extension of {eta}-VAE, wherein the hyperparameter, {eta}, is empirically adjusted. L-VAE mitigates the limitations of {eta}-VAE by learning the relative weights of the terms in the loss function to control the dynamic trade-off between disentanglement and reconstruction losses. In the proposed model, the weight of the loss terms and the parameters of the model architecture are learned concurrently. An additional regularization term is added to the loss function to prevent bias towards either reconstruction or disentanglement losses. Experimental analyses show that the proposed L-VAE finds an effective balance between reconstruction fidelity and disentangling the latent dimensions. Comparisons of the proposed L-VAE against {eta}-VAE, VAE, ControlVAE, DynamicVAE, and σ-VAE on datasets, such as dSprites, MPI3D-complex, Falcor3D, and Isaac3D reveals that L-VAE consistently provides the best or the second best performances measured by a set of disentanglement metrics. Moreover, qualitative experiments on CelebA dataset, confirm the success of the L-VAE model for disentangling the facial attributes.