Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the synergistic bottleneck between multimodal understanding and stable image generation in vision-language instruction following, this paper introduces the first unified multimodal foundation model jointly optimized for both comprehension and generation. Built upon a Transformer encoder-decoder architecture, our method innovatively incorporates a curriculum learning training strategy, integrating multi-source image-text pairs at the million-scale with billion-parameter model scaling. Through coordinated expansion across training objectives, data scale, and model capacity, it overcomes the limitations of unidirectional modality alignment. The proposed approach significantly enhances fine-grained instruction adherence and image synthesis stability. It achieves state-of-the-art performance across multiple benchmarks—including MMBench, MME, and Text-to-Image Instruction—improving instruction-following accuracy by 31% and image generation stability by 42%.

Technology Category

Application Category

📝 Abstract
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Computer Vision
Natural Language Processing
Generative Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Training Method
Larger Training Dataset
Scaled-up Model
🔎 Similar Papers
2024-08-22International Conference on Learning RepresentationsCitations: 292