🤖 AI Summary
This work proposes an unsupervised reinforcement learning paradigm to enhance long-context comprehension in large language models without relying on human annotations or external teacher models. The approach trains models by having them reconstruct original documents through identifying and correctly ordering missing paragraphs within long texts. A reward mechanism based solely on paragraph ordering accuracy drives the learning process, eliminating the need for supervised signals. The study systematically investigates data construction strategies, training protocols, and scaling effects. Evaluated on the RULER and LongBench v2 benchmarks, the method significantly outperforms existing approaches, achieving notable gains on RULER without using any human-crafted long-context question-answering data.
📝 Abstract
Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models'supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.