π€ AI Summary
Existing unified multimodal models often adopt symmetric architectures (e.g., MoT), which struggle to reconcile the inherent modality disparities between understanding experts (e.g., LLMs) and generation experts (e.g., diffusion models), resulting in weak cross-modal alignment and suboptimal generation quality. To address this, we propose HBridgeβa novel H-shaped asymmetric fusion architecture. HBridge decouples shallow, modality-specific representations and selectively bridges understanding and generation experts only at intermediate layers. It introduces semantic reconstruction tokens to explicitly guide visual semantic restoration and designs a heterogeneous expert Transformer bridge with selective attention sharing. Evaluated on multiple benchmarks, HBridge reduces attention computation by over 40% compared to state-of-the-art methods, while significantly improving both generation efficiency and fidelity. This work establishes a new paradigm for unified multimodal generation.
π Abstract
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.