Multimodal ELBO with Diffusion Decoders

📅 2024-08-29

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the low image fidelity and cross-modal semantic inconsistency inherent in multimodal variational autoencoders (VAEs), this paper proposes DiffMM-VAE. Methodologically, it introduces a diffusion model as the unified decoder for multimodal VAEs—replacing conventional feedforward decoders for the first time—and incorporates an auxiliary score-matching module to strengthen unconditional generation capability. Furthermore, it establishes an end-to-end co-training framework jointly optimizing diffusion and feedforward decoders. By integrating variational inference, multimodal latent variable modeling, and joint distribution optimization, DiffMM-VAE preserves full flexibility for both conditional and unconditional generation across arbitrary modalities while substantially improving image fidelity and cross-modal consistency. Extensive experiments on multiple benchmark datasets demonstrate state-of-the-art performance in the multimodal VAE literature.

Technology Category

Application Category

📝 Abstract

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Variational Autoencoder

Quality Degradation

Incoherent Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved Multimodal Variational Autoencoder

Diffusion Decoding Mechanism

Auxiliary Component for Enhanced Consistency

🔎 Similar Papers

A Markov Random Field Multi-Modal Variational AutoEncoder