HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing unified multimodal models often adopt symmetric architectures (e.g., MoT), which struggle to reconcile the inherent modality disparities between understanding experts (e.g., LLMs) and generation experts (e.g., diffusion models), resulting in weak cross-modal alignment and suboptimal generation quality. To address this, we propose HBridge—a novel H-shaped asymmetric fusion architecture. HBridge decouples shallow, modality-specific representations and selectively bridges understanding and generation experts only at intermediate layers. It introduces semantic reconstruction tokens to explicitly guide visual semantic restoration and designs a heterogeneous expert Transformer bridge with selective attention sharing. Evaluated on multiple benchmarks, HBridge reduces attention computation by over 40% compared to state-of-the-art methods, while significantly improving both generation efficiency and fidelity. This work establishes a new paradigm for unified multimodal generation.

Technology Category

Application Category

📝 Abstract

Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

Problem

Research questions and friction points this paper is trying to address.

Bridges heterogeneous experts for unified multimodal understanding and generation

Reduces inefficient attention sharing between modality-specific layers

Enhances cross-modal coherence through selective intermediate layer bridging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric H-shaped architecture bridges heterogeneous experts

Selective mid-layer bridging reduces attention sharing by 40%

Semantic reconstruction tokens guide cross-modal generation coherence

🔎 Similar Papers

No similar papers found.