MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the limitations of current large-scale text-to-image generation research, which is hindered by the absence of high-quality, deduplicated, diverse, and fine-grained annotated open datasets. To overcome this, the authors construct a dataset of 104.9 million high-quality image–text pairs from an initial pool of 2.9 billion raw samples through a multi-stage pipeline involving safety and domain filtering, exact and approximate deduplication, multimodal relabeling, and synthetic data augmentation. The resulting dataset is the first to simultaneously offer large scale, open licensing, non-redundancy, and rich semantic annotations, substantially lowering barriers to entry and enhancing reproducibility in the field. A 4-billion-parameter latent diffusion model trained on this dataset achieves strong performance on GenEval and DPG benchmarks.
📝 Abstract
Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.
Problem

Research questions and friction points this paper is trying to address.

text-to-image dataset
data curation
dataset deduplication
reproducible research
large-scale training
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-image dataset
deduplication
re-captioning
vision-language models
synthetic data augmentation