GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In instruction tuning, heterogeneous real-world data induces gradient interference—specifically, conflicting gradient directions—that degrades model performance. Method: This paper proposes an unsupervised full-dimensional gradient-space clustering method, the first to avoid accuracy loss from dimensionality reduction. Leveraging LoRA-based gradients, it employs an efficient online SVD algorithm to uncover latent skill structures in data. A lightweight routing mechanism replaces Mixture-of-Experts (MoE) inference, preserving specialization while drastically improving inference efficiency. Crucially, the approach requires no semantic priors, embedding knowledge, or human-crafted ensembles. Results: Evaluated on mathematical reasoning, code generation, finance, and creative writing tasks, our method consistently outperforms existing clustering- and fine-tuning-based baselines—achieving uniform accuracy gains and significant latency reduction across all benchmarks.

Technology Category

Application Category

📝 Abstract
Instruction tuning is one of the key steps required for adapting large language models (LLMs) to a broad spectrum of downstream applications. However, this procedure is difficult because real-world datasets are rarely homogeneous; they consist of a mixture of diverse information, causing gradient interference, where conflicting gradients pull the model in opposing directions, degrading performance. A common strategy to mitigate this issue is to group data based on semantic or embedding similarity. However, this fails to capture how data influences model parameters during learning. While recent works have attempted to cluster gradients directly, they randomly project gradients into lower dimensions to manage memory, which leads to accuracy loss. Moreover, these methods rely on expert ensembles which necessitates multiple inference passes and expensive on-the-fly gradient computations during inference. To address these limitations, we propose GradientSpace, a framework that clusters samples directly in full-dimensional gradient space. We introduce an online SVD-based algorithm that operates on LoRA gradients to identify latent skills without the infeasible cost of storing all sample gradients. Each cluster is used to train a specialized LoRA expert along with a lightweight router trained to select the best expert during inference. We show that routing to a single, appropriate expert outperforms expert ensembles used in prior work, while significantly reducing inference latency. Our experiments across mathematical reasoning, code generation, finance, and creative writing tasks demonstrate that GradientSpace leads to coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
Problem

Research questions and friction points this paper is trying to address.

Clusters data in full-dimensional gradient space to reduce interference
Uses online SVD on LoRA gradients to identify latent skills efficiently
Trains specialized LoRA experts with a router to improve inference accuracy and speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters data in full-dimensional gradient space
Uses online SVD on LoRA gradients for efficiency
Trains specialized LoRA experts with a lightweight router
🔎 Similar Papers
No similar papers found.
S
Shrihari Sridharan
School of Electrical and Computer Engineering, Purdue University
Deepak Ravikumar
Deepak Ravikumar
Amazon Science
Deep LearningOut-of-Distribution DetectionMemorizationTrustworthy ML
A
Anand Raghunathan
School of Electrical and Computer Engineering, Purdue University
K
Kaushik Roy
School of Electrical and Computer Engineering, Purdue University