Parallel Scaling Law for Language Models

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Conventional model scaling relies on parameter growth, leading to substantial increases in GPU memory consumption and inference latency. Method: We propose ParScale, a parallel scaling paradigm that introduces P-way learnable transformations and dynamic aggregation during both training and inference—enabling computational parallelization without parameter inflation. Contribution: We establish, for the first time, a theoretical scaling law linking parallel scale to performance, proving that P-way parallelism is asymptotically equivalent to only O(log P) parameter growth. ParScale supports zero-modification reuse of pretrained models, requiring only lightweight post-training adaptation for upgrading. Experiments demonstrate that, at comparable accuracy, ParScale reduces GPU memory usage by 22× and inference latency by 6× relative to conventional parameter-based scaling—significantly lowering both training and deployment costs.

Technology Category

Application Category

📝 Abstract

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model's parallel computation during both training and inference time. We apply $P$ diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the $P$ outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with $P$ parallel streams is similar to scaling the parameters by $O(log P)$ while showing superior inference efficiency. For example, ParScale can use up to 22$ imes$ less memory increase and 6$ imes$ less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.

Problem

Research questions and friction points this paper is trying to address.

Introducing parallel scaling to improve inference efficiency

Proposing a new scaling law for parallel computation models

Reducing memory and latency costs compared to parameter scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces parallel scaling for efficient inference

Dynamically aggregates diverse parallel model outputs

Reuses parameters to reduce memory and latency

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models

2024-06-24arXiv.orgCitations: 3

Authors to Follow