PolicyLong: Towards On-Policy Context Extension

πŸ“… 2026-04-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of long-context extension in large language models, which is hindered by the scarcity of high-quality data exhibiting long-range dependencies and the misalignment between static offline data construction and dynamic model training. To overcome this, the authors propose PolicyLong, a novel framework that shifts data curation from an offline static paradigm to an online dynamic strategy. By iteratively leveraging the current model’s entropy estimates, retrieval-augmented verification, and context filtering, PolicyLong dynamically generates positive and hard negative samples that co-evolve with the model’s capabilities, thereby forming a self-generated curriculum. Evaluated on RULER, HELMET, and LongBench-v2 (with Qwen2.5-3B), PolicyLong substantially outperforms prior methods such as EntropyLong and NExtLong, achieving a +2.54 improvement in RULER score at 128K context length.
πŸ“ Abstract
Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.
Problem

Research questions and friction points this paper is trying to address.

context extension
on-policy learning
data distribution drift
long-context modeling
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy learning
context extension
self-curriculum
entropy-based screening
long-context LLMs
πŸ”Ž Similar Papers
J
Junlong Jia
Hunyuan Team, Tencent Chinese Academy of Sciences
Ziyang Chen
Ziyang Chen
Peking University
Quantum key distributionQuantum random number generation
X
Xing Wu
Hunyuan Team, Tencent Chinese Academy of Sciences
Chaochen Gao
Chaochen Gao
Institute of Information Engineering,Chinese Academy of Sciences
NLP Long-Context LLM
T
TingHao Yu
Hunyuan Team, Tencent Chinese Academy of Sciences
F
Feng Zhang
Hunyuan Team, Tencent Chinese Academy of Sciences
S
Songlin Hu
Hunyuan Team, Tencent Chinese Academy of Sciences