FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

๐Ÿ“… 2025-09-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Open-source text-to-image models suffer from insufficient inference-oriented training data and a lack of comprehensive evaluation benchmarks, hindering the development of advanced reasoning capabilities. To address this, we introduce FLUX-Reason-6Mโ€”a large-scale bilingual dataset comprising 6 million images and 20 million English-Chinese caption pairsโ€”covering six reasoning dimensions: imagination, entity reasoning, text rendering, spatial reasoning, compositional reasoning, and causal reasoning. We further propose Generative Chain-of-Thought (GCoT) annotation to explicitly capture multi-step reasoning traces. Complementing this, we present PRISM-Bench, a seven-dimensional evaluation framework integrating prompt alignment, aesthetic quality, and fine-grained multi-step reasoning analysis. Leveraging the FLUX model and large-scale GPU infrastructure for data generation, and state-of-the-art vision-language models for automated assessment, we conduct systematic evaluations across 19 leading models. Results reveal critical bottlenecks in long-text comprehension and multi-step reasoning. All code, data, and benchmarks are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale reasoning datasets for text-to-image models
Absence of comprehensive evaluation benchmarks for T2I systems
Performance gap between open-source and closed-source T2I models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created 6M image dataset with bilingual reasoning descriptions
Designed Generation Chain-of-Thought for step breakdowns
Introduced 7-track benchmark with VLM-based evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.