Beyond Next Token Prediction: Patch-Level Training for Large Language Models

📅 2024-07-17

🏛️ International Conference on Learning Representations

📈 Citations: 1

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the prohibitively high pretraining cost of large language models (LLMs), which impedes rapid model iteration, this paper proposes PatchLM—a novel pretraining framework that replaces single-token units with semantically denser, multi-token “patches” as the fundamental modeling unit, introducing patch-level autoregressive modeling to LLM pretraining for the first time. The method features three key innovations: (1) dynamic patch construction guided by semantic coherence; (2) a two-stage hybrid training strategy—efficient patch-level prediction followed by lightweight token-level alignment; and (3) a sequence compression and reconstruction alignment mechanism to ensure inference compatibility with standard token-based architectures. Evaluated across model sizes from 370M to 2.7B parameters, PatchLM reduces pretraining cost by 50% (0.5×) while matching the performance of full-token baselines on mainstream benchmarks.

Technology Category

Application Category

📝 Abstract

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5$ imes$, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM training costs without performance loss

Introducing patch-level training for higher information density

Achieving 0.5x cost reduction while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-level training reduces LLM costs

Aggregates tokens into dense patches

Combines patch and token-level training

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models