Alchemist: Turning Public Text-to-Image Data into Generative Gold

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose text-to-image (T2I) supervised fine-tuning (SFT) datasets are scarce, narrowly scoped, costly to construct, and heavily reliant on opaque, proprietary data—severely hindering open-source T2I model development. To address this, we propose a novel paradigm that leverages pre-trained T2I models themselves as sample-value evaluators, eliminating manual annotation and heuristic rules. Our approach employs a joint confidence-diversity scoring scheme, cross-model consistency filtering, and collaborative aesthetic-alignment scoring to efficiently identify high-impact samples. Using only 3,350 curated samples, we construct Alchemist—a compact, highly generalizable SFT dataset. Experiments demonstrate substantial improvements in generation quality and stylistic diversity across five leading open-source T2I models. Both the Alchemist dataset and all corresponding fine-tuned model weights are publicly released.

Technology Category

Application Category

📝 Abstract
Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image model quality via supervised fine-tuning
Addressing scarcity of general-purpose fine-tuning datasets
Reducing costs in identifying impactful training samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained generative model for sample estimation
Creates compact yet effective general-purpose SFT dataset
Improves T2I models' quality while preserving diversity
🔎 Similar Papers
No similar papers found.