PaPaformer: Language Model from Pre-trained Paraller Paths

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the prohibitively long training time and high resource consumption of large language models (LLMs), this paper proposes PaPaformer—a decoder-only Transformer architecture based on parallel low-dimensional pathways. Its core innovation lies in decomposing the model into multiple independent sub-pathways, each trained in parallel on distinct data subsets, followed by a dynamic parameter merging strategy to reconstruct the full model. This design enables task-specific pathway construction, flexible model scaling, and efficient distributed training. Experiments demonstrate that PaPaformer achieves full-model training within hours—accelerating training by over an order of magnitude compared to conventional approaches—while substantially reducing GPU memory requirements and total parameter count. Crucially, it maintains or even improves performance across downstream tasks. This work validates the feasibility of hour-scale efficient LLM training and lightweight, customizable deployment.

Technology Category

Application Category

📝 Abstract

The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces extit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.

Problem

Research questions and friction points this paper is trying to address.

Reducing training time for language models

Combining parallel paths into larger models

Customizing paths for specific task requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel paths combined into larger model

Train paths individually with diverse data

Reduce parameters and time, increase performance

🔎 Similar Papers

No similar papers found.