HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

πŸ“… 2025-11-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing unified multimodal models often adopt symmetric architectures (e.g., MoT), which struggle to reconcile the inherent modality disparities between understanding experts (e.g., LLMs) and generation experts (e.g., diffusion models), resulting in weak cross-modal alignment and suboptimal generation quality. To address this, we propose HBridgeβ€”a novel H-shaped asymmetric fusion architecture. HBridge decouples shallow, modality-specific representations and selectively bridges understanding and generation experts only at intermediate layers. It introduces semantic reconstruction tokens to explicitly guide visual semantic restoration and designs a heterogeneous expert Transformer bridge with selective attention sharing. Evaluated on multiple benchmarks, HBridge reduces attention computation by over 40% compared to state-of-the-art methods, while significantly improving both generation efficiency and fidelity. This work establishes a new paradigm for unified multimodal generation.

Technology Category

Application Category

πŸ“ Abstract
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
Problem

Research questions and friction points this paper is trying to address.

Bridges heterogeneous experts for unified multimodal understanding and generation
Reduces inefficient attention sharing between modality-specific layers
Enhances cross-modal coherence through selective intermediate layer bridging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric H-shaped architecture bridges heterogeneous experts
Selective mid-layer bridging reduces attention sharing by 40%
Semantic reconstruction tokens guide cross-modal generation coherence
πŸ”Ž Similar Papers
No similar papers found.
X
Xiang Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Zhifei Zhang
Zhifei Zhang
Adobe Research
Computer VisionDeep learning
H
He Zhang
Adobe Research
Zhe Lin
Zhe Lin
Assistant Professor, the School of Integrated Circuits, Sun Yat-sen University, China
FPGAEDAreconfigurable computingheterogeneous computing
Yuqian Zhou
Yuqian Zhou
Senior Research Scientist at Adobe Research
computer visionlow-quality visionmedical image processingaffective computinghuman computer
Q
Qing Liu
Adobe Research
S
Shiwei Zhang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Yijun Li
Yijun Li
Adobe Research
Computer Vision
S
Shaoteng Liu
Adobe Research
Haitian Zheng
Haitian Zheng
Research Scientist, Adobe Research
Computer VisionGenerative ModelImage Manipulation and Editing
Jason Kuen
Jason Kuen
Adobe Research
Deep LearningComputer Vision
Y
Yuehuan Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
C
Changxin Gao
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition