Transferable text data distillation by trajectory matching

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high redundancy and exorbitant cost of training data for large language models (LLMs), this paper introduces, for the first time, data distillation into NLP text generation tasks—specifically instruction tuning—and proposes the first instruction-tuning-oriented data distillation framework. Methodologically, it learns pseudo-prompts via trajectory matching and synthesizes high-information samples through nearest-neighbor retrieval coupled with regularization losses; it further supports cross-architecture transfer (e.g., from OPT to Llama). Experiments demonstrate significant improvements over the state-of-the-art data selection method LESS on ARC-Easy and MMLU, validating the framework’s effectiveness, robustness, and transferability. By overcoming the fundamental limitation of discrete text in conventional distillation, this work establishes a novel paradigm for efficient LLM training.

Technology Category

Application Category

📝 Abstract
In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
Problem

Research questions and friction points this paper is trying to address.

Minimize LLM training costs via data distillation
Overcome text discreteness for NLP data distillation
Enable cross-architecture transfer in text generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory matching for pseudo prompt learning
Nearest neighbor ID for cross-architecture transfer
Regularization loss enhances distilled data robustness
🔎 Similar Papers
No similar papers found.
R
Rong Yao
Huawei Noah’s Ark Lab
Hailin Hu
Hailin Hu
Huawei Noah's Ark Lab
Y
Yifei Fu
Huawei Noah’s Ark Lab
Hanting Chen
Hanting Chen
Noah's Ark Lab, Huawei
deep learningmachine learningcomputer vision
W
Wenyi Fang
Huawei Noah’s Ark Lab
F
Fanyi Du
Huawei Noah’s Ark Lab
K
Kai Han
Huawei Noah’s Ark Lab
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision