OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current multimodal large models typically handle understanding, generation, and retrieval tasks separately, leading to high computational overhead and weak cross-modal generalization. To address this, we propose the first lightweight multimodal framework enabling unified modeling of all three tasks: it leverages a pre-trained language model (LLM) as the central language backbone, introduces a learnable bidirectional latent space alignment module, and employs a two-stage decoupled training strategy combined with semantic-guided diffusion training to mitigate task interference. Crucially, our framework achieves end-to-end visual–language representation alignment within a shared latent space—without requiring from-scratch training. Extensive experiments demonstrate state-of-the-art or competitive performance across multiple benchmarks, validating the efficacy of latent space alignment for unified multimodal modeling. The code and models are publicly available.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (LLMs) have led to significant progress in understanding, generation, and retrieval tasks. However, current solutions often treat these tasks in isolation or require training LLMs from scratch, resulting in high computational costs and limited generalization across modalities. In this work, we present OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module. To address the challenge of task interference, we propose a two-stage decoupled training strategy: supervised fine-tuning and latent space alignment for aligning LLM behavior with multimodal reasoning, and semantic-guided diffusion training to align cross-modal latent spaces via learnable query embeddings. Extensive experiments across a wide range of benchmarks demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks. Moreover, our results highlight the effectiveness of latent space alignment for unifying multimodal modeling under a shared representation space. Code and models are released at https://github.com/xiao-xt/OmniBridge.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding, generation, and retrieval tasks

Addressing task interference and limited cross-modal generalization

Reducing computational costs by reusing pretrained language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight bidirectional latent alignment module

Two-stage decoupled training strategy

Semantic-guided diffusion with query embeddings

🔎 Similar Papers

No similar papers found.

Authors to Follow