🤖 AI Summary
This work addresses key limitations in cross-modal translation (MT)—including reliance on aligned dimensions, Gaussian prior assumptions, and modality-specific architectures—by proposing a universal, theoretically grounded solution. We introduce the Latent Denoising Diffusion Bridging Model (LDDBM), a framework enabling bidirectional translation between arbitrary modalities without requiring dimension-wise alignment or shared prior assumptions. LDDBM employs a domain-agnostic encoder-decoder architecture that jointly optimizes contrastive alignment loss and predictive loss within a shared latent space, augmented by a latent-space noise prediction mechanism to enhance training stability. Experiments demonstrate that LDDBM significantly outperforms state-of-the-art methods on diverse tasks—including multi-view-to-3D reconstruction, image super-resolution, and multi-view scene synthesis—establishing a new strong baseline for general-purpose cross-modal translation.
📝 Abstract
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.