🤖 AI Summary
Current text-guided image editing research is hindered by the scarcity of large-scale, high-quality, publicly available real-image editing datasets. To address this, we introduce Pico-Banana-400K—a dataset comprising 400K instruction-driven editing samples synthesized from authentic photographs, covering complex tasks including multi-turn editing, preference alignment, and instruction rewriting. We propose a systematic quality control framework and a fine-grained taxonomy of editing types, and curate three specialized subsets—multi-turn editing, preference comparison, and short/long instruction pairs—to enhance diversity and research utility. Samples are generated using the Nano-Banana model and rigorously filtered via multimodal large language models (MLLMs) for automated scoring and selection. Empirical evaluation demonstrates substantial improvements in content preservation and instruction fidelity. Pico-Banana-400K sets a new state-of-the-art in scale, photorealism, and task coverage, establishing a scalable benchmark for training and evaluating next-generation image editing models.
📝 Abstract
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.