🤖 AI Summary
This work addresses the prohibitively high memory consumption of full-parameter fine-tuning, which hinders deployment on consumer-grade GPUs, by proposing ChunkFT—a novel framework that enables byte-streaming full-parameter fine-tuning without any model architecture modifications. ChunkFT introduces a dynamic activation working-set mechanism to stream parameter chunks during optimization, supporting gradient computation for arbitrary sub-tensors. It further incorporates memory-efficient optimizer state management and maintains compatibility with standard backpropagation, ensuring theoretical convergence while drastically reducing GPU memory requirements. Experiments demonstrate that ChunkFT fine-tunes Llama-3-8B on a single RTX 4090 with only 13.72 GB of VRAM and successfully scales to 70B models on two H800 GPUs, achieving downstream task performance that matches or even surpasses conventional full fine-tuning and outperforms existing parameter-efficient methods.
📝 Abstract
This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.