Efficient Pretraining Length Scaling

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of context-length extension and excessive KV-cache overhead during large language model pretraining, this paper proposes PHD-Transformer. Methodologically, it introduces an implicit decoding token mechanism that enables hierarchical KV-cache management and immediate token eviction, preserving standard Transformer-scale cache memory. It further designs two variants—PHD-SWA and PHD-CSWA—that respectively enhance local dependency modeling and reduce prefill time complexity from linear to sublinear. Additionally, it integrates sliding-window and chunked sliding-window attention to support parallel implicit decoding. Empirically, PHD-Transformer achieves consistent performance gains across multiple benchmarks without increasing cache memory footprint, thereby significantly advancing the efficiency frontier of long-context pretraining.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer ( extit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. extit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: extit{PHD-SWA} employs sliding window attention to preserve local dependencies, while extit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Exploring length scaling potential in pre-training for large language models
Introducing PHD-Transformer for efficient pre-training length scaling
Optimizing KV cache management to maintain inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

PHD-Transformer enables efficient pretraining length scaling
KV cache management distinguishes original and hidden tokens
PHD-SWA and PHD-CSWA optimize local dependencies and pre-filling
🔎 Similar Papers
No similar papers found.