Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models suffer from low fidelity, training instability, and high computational cost due to low-quality and redundant samples in large-scale training datasets. To address this, we propose the first multimodal (text–image) meta-gradient-driven data selection framework—eliminating reliance on manual annotations or unidimensional heuristic scoring. Our approach employs a lightweight, learnable scorer that models gradient sensitivity across multiple granularities, coupled with a novel Shift-Gsampling subset selection strategy for automatic, scalable, and differentiable data influence estimation. Evaluated on both synthetic and web-crawled datasets, our method significantly improves generation quality and downstream task performance: training on only 50% of curated data surpasses the full-dataset baseline, achieving superior efficiency–effectiveness trade-offs.

Technology Category

Application Category

📝 Abstract
Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
Problem

Research questions and friction points this paper is trying to address.

Selects high-quality subsets from large-scale text-image data pairs.
Improves training efficiency and visual quality in text-to-image models.
Automates data selection via meta-gradient-based influence assessment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-gradient framework for data selection
Lightweight rater with multi-granularity perception
Shift-Gsampling strategy for efficient training
🔎 Similar Papers
No similar papers found.
K
Kaixin Ding
The University of Hong Kong
Y
Yang Zhou
South China University of Technology
X
Xi Chen
The University of Hong Kong
M
Miao Yang
Kling Team, Kuaishou Technology
Jiarong Ou
Jiarong Ou
Unknown affiliation
R
Rui Chen
Kling Team, Kuaishou Technology
Xin Tao
Xin Tao
Kuaishou
Computer VisionGenerative AI
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence