Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the synergistic bottleneck between multimodal understanding and stable image generation in vision-language instruction following, this paper introduces the first unified multimodal foundation model jointly optimized for both comprehension and generation. Built upon a Transformer encoder-decoder architecture, our method innovatively incorporates a curriculum learning training strategy, integrating multi-source image-text pairs at the million-scale with billion-parameter model scaling. Through coordinated expansion across training objectives, data scale, and model capacity, it overcomes the limitations of unidirectional modality alignment. The proposed approach significantly enhances fine-grained instruction adherence and image synthesis stability. It achieves state-of-the-art performance across multiple benchmarks—including MMBench, MME, and Text-to-Image Instruction—improving instruction-following accuracy by 31% and image generation stability by 42%.

Technology Category

Application Category

📝 Abstract

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Computer Vision

Natural Language Processing

Generative Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Training Method

Larger Training Dataset

Scaled-up Model

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation