JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

📅 2024-11-12
🏛️ arXiv.org
📈 Citations: 10
Influential: 2
📄 PDF
🤖 AI Summary
How can a single large model unify both visual understanding and generation? This paper proposes a lightweight unified architecture that, for the first time, directly trains Rectified Flow end-to-end within a large language model (LLM) framework—without modifying the backbone. It introduces a decoupled visual encoder and a cross-task representation alignment mechanism to mitigate semantic conflicts between understanding and generation. Furthermore, a multi-stage joint training strategy is adopted to jointly optimize autoregressive language modeling and flow matching objectives. Evaluated on MMBench, MME, and COCO Caption benchmarks, the method matches or surpasses task-specific models across all metrics, significantly outperforms existing unified architectures, and achieves state-of-the-art comprehensive performance.

Technology Category

Application Category

📝 Abstract
We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Unifies image understanding and generation in one model
Integrates autoregressive models with rectified flow
Improves performance via decoupled and aligned encoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates autoregressive models with rectified flow
Decouples and aligns understanding and generation encoders
Trains rectified flow within large language model framework
🔎 Similar Papers
No similar papers found.
Yiyang Ma
Yiyang Ma
DeepSeek-AI
Generative ModelsLarge Language Models
Xingchao Liu
Xingchao Liu
DeepSeek AI
Machine Learning
X
Xiaokang Chen
DeepSeek-AI
W
Wen Liu
DeepSeek-AI
C
Chengyue Wu
DeepSeek-AI
Zhiyu Wu
Zhiyu Wu
DeepSeek-AI, 北京大学
MLLMEmotion RecognitionSemi-Supervised Learning
Zizheng Pan
Zizheng Pan
Researcher at DeepSeek
Computer VisionModel EfficiencyMultimodal LLM
Z
Zhenda Xie
DeepSeek-AI
H
Haowei Zhang
DeepSeek-AI
X
Xingkai Yu
DeepSeek-AI
L
Liang Zhao
DeepSeek-AI
Y
Yisong Wang
DeepSeek-AI, Tsinghua University
Jiaying Liu
Jiaying Liu
Dalian University of Technology
Graph LearningData ScienceComputational Social Science
C
C. Ruan
DeepSeek-AI