Hellinger Multimodal Variational Autoencoders

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenges of inaccurate posterior approximation and the difficulty in balancing generation consistency and quality in multimodal variational autoencoders under weakly supervised generative learning. To this end, we propose HELVAE, a novel model that leverages a probabilistic opinion pooling perspective to derive a Hellinger moment-matching approximation based on Hölder pooling with α=0.5. This approach yields an efficient multimodal inference framework that eliminates the need for subsampling. In contrast to conventional product-of-experts or mixture-based strategies, the proposed Hellinger approximation learns more expressive latent representations, significantly improving the trade-off between generation quality and cross-modal consistency. Extensive experiments demonstrate that HELVAE outperforms state-of-the-art multimodal VAE methods across multiple modalities.

Technology Category

Application Category

📝 Abstract

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from H\"older pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Problem

Research questions and friction points this paper is trying to address.

multimodal variational autoencoders

joint posterior approximation

probabilistic opinion pooling

generative coherence

latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hellinger divergence

multimodal VAE

probabilistic opinion pooling