ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the closed nature of existing multimodal generative models and the lack of open-source alternatives with GPT-4o-level image generation capability, this paper introduces Janus-4o—the first open-source, dual-mode multimodal large language model trained end-to-end from scratch on high-quality synthetic data. Methodologically, we construct ShareGPT-4o-Image, the first high-fidelity open-source text-to-image and image-conditioned image synthesis dataset (45K text→image + 46K image-text→image samples), leveraging GPT-4o distillation, multi-task joint training, and instruction-aligned fine-tuning—completed in just six hours on eight A800 GPUs. Our contributions are threefold: (1) demonstrating strong image generation from only 91K synthetic samples; (2) surpassing Janus-Pro in text-to-image generation quality; and (3) unifying both generation modes within a single architecture, significantly advancing open multimodal generative research.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
Problem

Research questions and friction points this paper is trying to address.

Democratizing GPT-4o-level image generation capabilities
Creating open dataset for multimodal model alignment
Developing efficient text-and-image-to-image generation model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesized GPT-4o dataset for image generation
Developed Janus-4o multimodal language model
Achieved efficient training with minimal resources