Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors

📅 2024-06-14

📈 Citations: 0

✨ Influential: 0

career value

263K/year

🤖 AI Summary

To address GPU memory bottlenecks in large language model (LLM) fine-tuning, this paper proposes LSP-Offload: a CPU-GPU co-processing framework leveraging data-driven sparse projector learning; a layer-adaptive communication scheduling strategy to maximize computation-communication overlap; and heterogeneous memory offloading combined with low-precision-aware compression—achieving substantial efficiency gains without accuracy loss. Its key innovation lies in the first joint optimization of sparse projection learning and inter-layer communication scheduling, simultaneously ensuring zero-accuracy degradation and high system throughput. Experiments demonstrate successful fine-tuning of a 1.3B-parameter model on a 4GB laptop GPU and a 6.7B-parameter model on a 24GB RTX 4090, reducing end-to-end training time by 33.1%–62.5% while preserving full model accuracy.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU. In this paper, we present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on a 24GB NVIDIA RTX 4090 GPU. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy. We open source our framework at https://github.com/gulang2019/LSP-Offload.

Problem

Research questions and friction points this paper is trying to address.

Memory-efficient LLM fine-tuning on GPUs

Reducing CPU-GPU communication bottlenecks

Enabling large models on commodity hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned sparse projectors

Layer-wise communication schedule

Efficient sparse compressors

🔎 Similar Papers

Sparse Matrix in Large Language Model Fine-tuning