How to Train Long-Context Language Models (Effectively)

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 29
Influential: 4
📄 PDF
🤖 AI Summary
Efficiently training and evaluating large language models (LLMs) for long-context understanding and generation remains challenging due to computational costs and lack of standardized evaluation protocols. Method: We propose ProBench, a dedicated benchmarking framework for long-context evaluation, and systematically investigate continual pretraining, supervised fine-tuning (SFT), and position encoding extension techniques—including positional interpolation and extrapolation—using a multi-source long-text mixture (codebases, books, high-quality short texts). Contribution/Results: We find that long-sequence pretraining substantially improves long-context performance, and short-instruction SFT alone suffices to enhance long-range modeling. We release ProLong-8B, achieving state-of-the-art (SOTA) results at 128K context length among 8B-parameter models. It surpasses Llama-3.1-8B-Instruct using only 5% of its long-context training data and natively supports up to 512K context length.

Technology Category

Application Category

📝 Abstract
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite using only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data mix for long-context LM training
Enhancing LM performance beyond evaluation length
Improving long-context tasks with short instruction datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines long and short-context data effectively
Trains beyond evaluation length for better performance
Uses short instruction datasets for SFT
🔎 Similar Papers
No similar papers found.