🤖 AI Summary
This work addresses the limited planning and execution capabilities of large language models in extremely long-horizon tasks by proposing the KLong framework. KLong employs trajectory-segmented supervised fine-tuning (SFT) to preserve early contextual information and maintain overlapping sub-trajectories for effective cold-starting, coupled with a multi-stage progressive reinforcement learning (RL) strategy that incrementally increases task duration and complexity. The framework further integrates a Research-Factory automated data generation pipeline and a trajectory distillation mechanism based on Claude 4.5 Sonnet. Experimental results demonstrate that KLong (106B) outperforms Kimi K2 Thinking (1T) by 11.28% on PaperBench and exhibits significant generalization advantages on coding benchmarks such as SWE-bench Verified and MLE-bench.
📝 Abstract
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.