On Representation Redundancy in Large-Scale Instruction Tuning Data Selection

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high redundancy in semantic representations generated by large language model encoders within massive instruction-tuning datasets, which limits the efficiency and effectiveness of existing representation-based data selection methods. The work is the first to identify this issue and proposes a Compressed Representation Data Selection (CRDS) framework, introducing two novel approaches: CRDS-R, which combines Rademacher random projection with concatenated Transformer hidden layers, and CRDS-W, which leverages whitening-based dimensionality reduction. Experimental results demonstrate that CRDS-W achieves superior performance using only 3.5% of the full dataset, outperforming the full-data baseline by an average of 0.71% and significantly surpassing current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.
Problem

Research questions and friction points this paper is trying to address.

instruction tuning
data selection
representation redundancy
large language models
semantic similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction tuning
representation redundancy
data selection
dimensionality reduction
semantic embeddings
🔎 Similar Papers
No similar papers found.