LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the high computational cost and inefficiency of training unified multimodal models from scratch, this paper proposes DualFusion—a lightweight dual-fusion framework that achieves efficient unified modeling without retraining foundational models. Instead, it integrates existing specialized understanding and generation models via a novel dual-fusion mechanism: (i) cross-network multimodal self-attention in the high-level semantic space to enable holistic reasoning, and (ii) fine-grained feature-space alignment in the low-level representation space for precise cross-modal grounding. DualFusion requires only ~35B tokens of fine-tuning data for integration. It achieves state-of-the-art performance across four comprehensive multimodal evaluation benchmarks: GenEval (0.91), DPG-Bench (82.16), GEditBench (6.06), and ImgEdit-Bench (3.77). All code, model weights, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

Problem

Research questions and friction points this paper is trying to address.

Efficiently fusing specialized models for multimodal tasks

Enabling unified understanding and generation with minimal training

Achieving strong performance across diverse multimodal benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses existing models for multimodal understanding and generation

Interleaves multimodal self-attention blocks into base networks

Trains efficiently with only 35B tokens across benchmarks

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation