Technical Report on Text Dataset Distillation

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text dataset distillation faces challenges including difficulty in modeling discrete token sequences, absence of standardized benchmarks, poor adaptability to complex NLP tasks, and unclear deployment pathways. This paper introduces the first end-to-end distillation framework specifically designed for discrete textual data, integrating Transformer architectures with decoder-only large language models (≥1B parameters) and proposing a knowledge-transfer-driven synthesis mechanism to efficiently compress large-scale text corpora into compact, high-quality synthetic datasets. Our method preserves downstream model performance across multiple NLP benchmarks—including classification and generation tasks—demonstrating strong generalization capability. Furthermore, we systematically analyze core technical challenges and emerging research paradigms in text distillation, establishing it as an independent subfield. The work also provides a reproducible baseline for standardized evaluation and practical deployment, paving the way toward industrial adoption of text distillation techniques.

Technology Category

Application Category

📝 Abstract
In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
Problem

Research questions and friction points this paper is trying to address.

Text dataset distillation adapts vision methods to condense large datasets.
It overcomes text modality obstacles using transformer and decoder models.
The field needs standardized benchmarks and real-world application examples.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting vision distillation methods to text
Using transformer models for synthetic text generation
Scaling to decoder-only models with 1B+ parameters
🔎 Similar Papers
No similar papers found.
K
K. Ogawa
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
B
Bruno Yamamoto
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
L
Lucas Lauton de Alcantara
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil
V
Victor Zacarias
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
E
Edson Bollis
Instituto de Ciência e Tecnologia Itaú (ICTi), São Paulo, Brazil
L
Lucas Pellicer
Instituto de Ciência e Tecnologia Itaú (ICTi), São Paulo, Brazil
R
Rosimeire Pereira Costa
Instituto de Ciência e Tecnologia Itaú (ICTi), São Paulo, Brazil
Anna Helena Reali Costa
Anna Helena Reali Costa
Full Professor of Computer Engineering, Universidade de São Paulo
Artificial IntelligenceMachine LearningReinforcement LearningIntelligent Robotics
A
Artur Jordão
Escola Politécnica, Universidade de São Paulo, São Paulo, Brazil