🤖 AI Summary
To address the prohibitively long training time and high resource consumption of large language models (LLMs), this paper proposes PaPaformer—a decoder-only Transformer architecture based on parallel low-dimensional pathways. Its core innovation lies in decomposing the model into multiple independent sub-pathways, each trained in parallel on distinct data subsets, followed by a dynamic parameter merging strategy to reconstruct the full model. This design enables task-specific pathway construction, flexible model scaling, and efficient distributed training. Experiments demonstrate that PaPaformer achieves full-model training within hours—accelerating training by over an order of magnitude compared to conventional approaches—while substantially reducing GPU memory requirements and total parameter count. Crucially, it maintains or even improves performance across downstream tasks. This work validates the feasibility of hour-scale efficient LLM training and lightweight, customizable deployment.
📝 Abstract
The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces extit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.