Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

📅 2024-05-21

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the high resource consumption and low efficiency in long-sequence training of large language models (LLMs), this paper proposes a multi-scale sequence bucketing and dynamic curriculum learning framework grounded in actual document lengths. Methodologically: (1) documents are clustered by length to construct multi-scale sequence buckets; (2) a length-aware dynamic batch-size scheduler and cross-bucket progressive sampling strategy are designed; (3) sparse attention is adapted via mask optimization. We first reveal the critical impact of sequence-length distribution and scheduling strategies on pretraining performance, establishing the first variable-length curriculum learning paradigm based on real document lengths. Experiments show that, under identical compute budgets, our method enables 8K-context training (baseline: 2K), improves long-text task performance by 4.2% on average, accelerates training sixfold over the baseline, and demonstrates strong scalability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch-size, sampling simultaneously from all buckets with a curriculum. In contrast to the concat-and-chunk baseline, which incurs a fixed attention cost at every step of training, our proposed method incurs a computational cost proportional to the actual document lengths at each step, resulting in significant savings in training time. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy with up to 6x faster training compared to the baseline. Our method not only enables efficient pretraining on long sequences but also scales effectively with dataset size. Lastly, we shed light on a critical yet less studied aspect of training large language models: the distribution and curriculum of sequence lengths, which results in a non-negligible difference in performance.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Long Sequence Training

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset Decomposition

Dynamic Adjustment

Long Sequence Training Efficiency

🔎 Similar Papers

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review