🤖 AI Summary
To address the challenge of fine-tuning billion- to trillion-parameter large language models (LLMs) on consumer-grade hardware—e.g., a single RTX 4090 GPU with 256 GB CPU memory—this work proposes a Global Offloading with Traffic-Aware Optimization framework. Our method introduces a novel dynamic scheduling mechanism that jointly orchestrates proactive gradient offloading and traffic-aware activation swapping, integrated with CPU-GPU cooperative memory management, fine-grained activation swapping, and gradient lifecycle optimization. We achieve, for the first time, full-parameter fine-tuning of a 175B-parameter LLM on a single RTX 4090. Our approach attains 2.32× higher throughput than state-of-the-art baselines for 13B models and delivers superior cost-efficiency per fine-tuning unit for 175B models compared to an NVIDIA DGX-A100 cluster. By overcoming memory bottlenecks inherent in single-node systems, our framework establishes a scalable, lightweight fine-tuning paradigm for massive LLMs.
📝 Abstract
Nowadays, AI researchers become more and more interested in fine-tuning a pre-trained LLM, whose size has grown to up to over 100B parameters, for their downstream tasks. One approach to fine-tune such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most data scientists with a limited budget for high-end GPU servers. In this paper, we focus on LLM fine-tuning on a single consumer-grade GPU in a commodity server with limited main memory capacity, which is accessible to most AI researchers. In such a scenario, existing offloading-based methods fail to fine-tune an LLM efficiently due to a lack of holistic intra-server tensor movement management. To this end, we present LoHan, a low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server with a consumer-grade GPU and limited main memory capacity. The key idea is to add holistic offloading traffic as an optimization dimension for 1)active gradient offloading, and 2)holistic traffic-aware activation swapping mechanism. The experimental results show that 1)LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory, 2)LoHan achieves 2.32x throughput than the state-of-the-art baselines when fine-tuning a small 13B model, and 3)LoHan enables a cheap low-end consumer GPU to have higher cost-effectiveness than a DGX-A100 cluster when fine-tuning a 175B model.