🤖 AI Summary
This work addresses the slow convergence and mode collapse commonly encountered in traditional mixture density networks under maximum likelihood training. By reframing these networks as deep latent variable models, the study integrates the Expectation-Maximization (EM) framework with information geometry theory to propose, for the first time, a natural gradient EM (nGEM) objective function. This formulation reveals an intrinsic connection between mixture density networks and natural gradient descent. The resulting method achieves substantial improvements in training efficiency—accelerating convergence by up to tenfold—while maintaining robust performance on high-dimensional data, all with negligible additional computational overhead. Moreover, it effectively overcomes the failure modes associated with conventional negative log-likelihood optimization.
📝 Abstract
Mixture density networks are neural networks that produce Gaussian mixtures to represent continuous multimodal conditional densities. Standard training procedures involve maximum likelihood estimation using the negative log-likelihood (NLL) objective, which suffers from slow convergence and mode collapse. In this work, we improve the optimization of mixture density networks by integrating their information geometry. Specifically, we interpret mixture density networks as deep latent-variable models and analyze them through an expectation maximization framework, which reveals surprising theoretical connections to natural gradient descent. We then exploit such connections to derive the natural gradient expectation maximization (nGEM) objective. We show that empirically nGEM achieves up to 10$\times$ faster convergence while adding almost zerocomputational overhead, and scales well to high-dimensional data where NLL otherwise fails.