🤖 AI Summary
This work addresses mode collapse and limited representational capacity in Soft-IntroVAE (S-IntroVAE) caused by fixed priors. We propose a learnable multimodal prior framework, treating the prior as a “third participant” jointly optimized with the encoder and decoder under a shared Nash equilibrium adversarial training scheme. Building upon a modified ELBO, we derive adaptive variance clipping and responsibility regularization—provably balancing prior diversity and faithful latent variable assignment. Experiments on 2D density estimation and benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10) demonstrate significant improvements in sample quality, log-likelihood scores, and semantic consistency of the latent space. To our knowledge, this is the first approach to jointly optimize prior structure, learning mechanism, and representation performance. Our results empirically validate that learnable multimodal priors yield dual benefits for both generative modeling and representation learning.
📝 Abstract
Variational Autoencoders (VAEs) are a popular framework for unsupervised learning and data generation. A plethora of methods have been proposed focusing on improving VAEs, with the incorporation of adversarial objectives and the integration of prior learning mechanisms being prominent directions. When it comes to the former, an indicative instance is the recently introduced family of Introspective VAEs aiming at ensuring that a low likelihood is assigned to unrealistic samples. In this study, we focus on the Soft-IntroVAE (S-IntroVAE) and investigate the implication of incorporating a multimodal and learnable prior into this framework. Namely, we formulate the prior as a third player and show that when trained in cooperation with the decoder constitutes an effective way for prior learning, which shares the Nash Equilibrium with the vanilla S-IntroVAE. Furthermore, based on a modified formulation of the optimal ELBO in S-IntroVAE, we develop theoretically motivated regularizations, that is (i) adaptive variance clipping to stabilize training when learning the prior and (ii) responsibility regularization to discourage the formation of inactive prior mode. Finally, we perform a series of targeted experiments on a 2D density estimation benchmark and in an image generation setting comprised of the (F)-MNIST and CIFAR-10 datasets demonstrating the benefit of prior learning in S-IntroVAE in generation and representation learning.