Revealing the Learning Dynamics of Long-Context Continual Pre-training

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses the challenges industrial-scale large language models face during extended pretraining with long contexts—namely insufficient data, distorted evaluation metrics, and misjudged convergence—by systematically analyzing the learning dynamics of the Hunyuan-A13B model over a 200B-token training trajectory. Employing a three-tier analytical framework grounded in behavior, probability, and mechanism, the work integrates supervised fine-tuning probes, perplexity (PPL) evaluation, and attention pattern tracking. It reveals for the first time that industrial-scale models require over 150B tokens to reach performance saturation, introduces an intrinsic saturation criterion based on PPL to effectively avoid “deceptive convergence,” and identifies retrieval attention heads as a low-overhead, highly reliable indicator of training progress, whose scores exhibit strong correlation with downstream task performance.

Technology Category

Application Category

📝 Abstract

Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.

Problem

Research questions and friction points this paper is trying to address.

Long-Context Continual Pre-training

industrial-grade LLMs

deceptive saturation

learning dynamics

training convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Context Continual Pre-training

learning dynamics

deceptive saturation