Gradient Transformer: Learning to Generate Updates for LLMs

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge faced by resource-constrained organizations in fine-tuning large language models (LLMs) on private data, where fine-tuning smaller models alone yields limited performance. To overcome this, the authors propose a novel data-free knowledge distillation framework featuring a Gradient Transformer architecture that maps parameter update vectors from a small model to their corresponding updates in a large model, enabling efficient LLM adaptation without accessing private data. The approach supports collaborative multi-party model updates and integrates differential privacy to ensure data security. Experimental results demonstrate that the method significantly outperforms existing distillation techniques on language modeling and reasoning tasks, maintaining strong performance even under stringent privacy constraints.
📝 Abstract
Many organizations lack computational resources to fine-tune large language models (LLMs) on private (unshareable) data for better utility, while fine-tuning tiny language models (TinyLMs) alone performs poorly. To address this bottleneck, we propose a data-free knowledge distillation framework that generates LLM update vectors based on TinyLMs fine-tuned on private data. An update vector is a vector of parameter changes from an initial model to its fine-tuned version on a dataset, capturing the effect of cumulative gradient steps during fine-tuning. The key idea of our framework is a novel Gradient Transformer that transforms TinyLM's update vectors into LLM's update vectors. As derived from shadow datasets, Grad-Transformer captures the correlation between TinyLM and LLM update vectors, enabling third-party providers to generate LLM update vectors given the organization's TinyLM update vectors without accessing the organization's private data. The framework supports multi-organization collaboration to jointly update LLMs, improving performance and cost-efficiency. Extensive experiments across language modeling and reasoning tasks show that Grad-Transformer remarkably outperforms state-of-the-art knowledge distillation baselines, even under strict differential privacy protection.
Problem

Research questions and friction points this paper is trying to address.

large language models
private data
fine-tuning
knowledge distillation
parameter updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Transformer
update vector
knowledge distillation
private data
differential privacy