How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent

📅 2025-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge faced by small- and medium-sized teams—namely, insufficient computational resources to afford full-parameter pretraining of large language models (LLMs)—this paper proposes the first framework that deeply integrates Block Coordinate Descent (BCD) theory into full-parameter LLM training. Our method employs parameter block-wise optimization, memory-aware scheduling, mixed-precision computation, and cross-device weight consistency maintenance, ensuring hardware-agnostic convergence guarantees and enabling zero-modification migration across A100/A800 and RTX 4090 clusters. Experiments demonstrate that training a 7B model on an RTX 4090 cluster reduces cost to just 2.6% of conventional approaches, and to 33% on A100/A800 clusters—without any accuracy degradation. This work constitutes the first empirical validation of efficient, fidelity-preserving migration of high-end LLMs onto low-cost consumer-grade GPU clusters.

Technology Category

Application Category

📝 Abstract
Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we present a full-parameter pre-training framework based on block coordinate descent (BCD), augmented with engineering optimizations, to efficiently train large models on affordable RTX 4090 GPU clusters. BCD ensures model convergence based on block coordinate descent theory and performs gradient computation and update at the level of parameter blocks. Experiments show that 1) Lower cost of Same-Device: BCD significantly reduces pre-training cost. For the 7B model, under identical hardware settings, BCD lowers training costs to approximately 33% on A100,A800 clusters on 7B model averagely and to approximately 2.6% on RTX 4090 clusters on 7B model, compared to traditional full-parameter training. 2) Cross-Device Transfer: By leveraging BCD, large-scale models previously trainable only on high-end A100 clusters can be seamlessly migrated and pre-trained on 4090 clusters-whose hourly cost is only one-quarter that of A100-without requiring expensive hardware. 3) Accuracy Retention: In both scenarios, BCD training achieves the same level of model accuracy as full-parameter pre-training.
Problem

Research questions and friction points this paper is trying to address.

Reducing GPU memory requirements for LLM training
Lowering financial costs of large model pre-training
Enabling efficient training on consumer-grade hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses block coordinate descent for model training
Enables cost-effective training on RTX 4090 GPUs
Achieves comparable accuracy with lower GPU consumption
🔎 Similar Papers
No similar papers found.
Z
Zeyu Liu
School of Computer Science and Technology, North University of China
Yunquan Zhang
Yunquan Zhang
Professor of Institute of Computing Technology, CAS
parallel computingparallel programmingparallel computational model
B
Boyang Zhang
University of the Chinese Academy of Sciences
G
Guoyong Jiang
State Key Laboratory of Integrated Service Networks, Xidian University
D
Daning Cheng
Institute of Computing Technology, Chinese Academy of Sciences