JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

📅 2024-11-12

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 2

career value

180K/year

🤖 AI Summary

How can a single large model unify both visual understanding and generation? This paper proposes a lightweight unified architecture that, for the first time, directly trains Rectified Flow end-to-end within a large language model (LLM) framework—without modifying the backbone. It introduces a decoupled visual encoder and a cross-task representation alignment mechanism to mitigate semantic conflicts between understanding and generation. Furthermore, a multi-stage joint training strategy is adopted to jointly optimize autoregressive language modeling and flow matching objectives. Evaluated on MMBench, MME, and COCO Caption benchmarks, the method matches or surpasses task-specific models across all metrics, significantly outperforms existing unified architectures, and achieves state-of-the-art comprehensive performance.

Technology Category

Application Category

📝 Abstract

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

Problem

Research questions and friction points this paper is trying to address.

Unifies image understanding and generation in one model

Integrates autoregressive models with rectified flow

Improves performance via decoupled and aligned encoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates autoregressive models with rectified flow

Decouples and aligns understanding and generation encoders

Trains rectified flow within large language model framework

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

2024-08-22International Conference on Learning RepresentationsCitations: 292