LM-mixup: Text Data Augmentation via Language Model based Mixup

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality instruction data is scarce, while low-quality or redundant data is typically discarded, leading to substantial information loss. Method: This paper introduces “instruction distillation”—a novel task that leverages language model–driven Mixup to transform low-quality instruction data into high-quality, semantically coherent, and format-compliant instruction-output pairs. Based on this approach, we construct MIXTURE, a dataset comprising 144K distilled samples, and propose a triple-reward mechanism integrating quality assessment, semantic alignment, and format compliance. We then apply supervised fine-tuning followed by group-relative policy optimization (GRPO) for reinforcement learning. Contribution/Results: Fine-tuning with only ~3% of the distilled data surpasses full-scale training on the original dataset and matches state-of-the-art data filtering methods across multiple instruction-following benchmarks—significantly reducing reliance on scarce high-quality instruction data.

Technology Category

Application Category

📝 Abstract
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
Problem

Research questions and friction points this paper is trying to address.

Distilling low-quality instruction data into high-quality pairs
Augmenting scarce high-quality data with abundant low-quality data
Improving instruction tuning efficiency through data distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LM-Mixup distills low-quality data into high-quality pairs
Uses supervised fine-tuning and reinforcement learning optimization
Employs three reward signals for quality and alignment
🔎 Similar Papers
No similar papers found.
Z
Zhijie Deng
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zhouan Shen
The Hong Kong University of Science and Technology (Guangzhou)
L
Ling Li
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yao Zhou
The Hong Kong University of Science and Technology (Guangzhou)
Zhaowei Zhu
Zhaowei Zhu
Docta.ai; University of California, Santa Cruz
Machine learningData QualityLabel NoiseResponsible AI
Y
Yanji He
The Hong Kong University of Science and Technology (Guangzhou)
W
Wei Wang
The Hong Kong University of Science and Technology (Guangzhou)
J
Jiaheng Wei
The Hong Kong University of Science and Technology (Guangzhou)