OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal image generation and editing models are constrained by the scale, systematicity, and task complexity of training data. To address this, we introduce the first large-scale, structured, high-quality dataset covering 11 domains and 51 subtasks, comprising 80,000 instruction–image pairs. Our method features a novel hierarchical task taxonomy—first incorporating scientifically grounded image generation and concurrent multi-step instruction execution—and an automated data generation pipeline integrating GPT-4o with a curated structural resource pool to ensure both diversity and controllability. Evaluated on ImgEdit-Bench and GenEval benchmarks, our model achieves up to 18% improvement in editing performance and 13% in generation performance over prior baselines. These advances significantly enhance the generalization capability and real-world applicability of multimodal models in complex, heterogeneous scenarios.

Technology Category

Application Category

📝 Abstract
The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multimodal training data quality
Systematically covering complex real-world image editing scenarios
Enhancing image generation and editing through structured datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical task taxonomy for systematic data structure
Automated pipeline generating 80k instruction-image pairs
Covers 11 domains with scientific imagery and complex editing
🔎 Similar Papers
No similar papers found.