h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models exhibit significant performance degradation on long-horizon reasoning tasks; existing approaches rely on inference-time scaffolding or dense step-level supervision, limiting scalability. This paper proposes a curriculum-based reinforcement learning framework that requires no step-wise annotations—only outcome-level reward—and leverages short-horizon data to autonomously generate multi-step reasoning chains, enabling scalable training of long-horizon reasoning capabilities. Key contributions include: (1) a problem synthesis mechanism that constructs multi-step dependency chains; (2) a theoretical proof demonstrating superior sample complexity of the curriculum RL strategy over full-horizon training; and (3) substantially enhanced capacity for discovering novel reasoning paths. After training on GSM8K, our model achieves up to a 2.06× accuracy improvement on long-chain mathematical benchmarks—including MATH-500 and AIME—and consistently outperforms baselines under high pass@k metrics.

Technology Category

Application Category

📝 Abstract
Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' long-horizon reasoning via reinforcement learning
Scaling reasoning capabilities using only short-horizon training data
Improving performance on complex multi-step dependency chains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic composition of simple problems into complex chains
Curriculum RL training with outcome-only rewards
Exponential sample complexity improvement over full-horizon training
🔎 Similar Papers
No similar papers found.
S
S. Motwani
University of Oxford
A
Alesia Ivanova
University of Oxford
Z
Ziyang Cai
Princeton University
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
Riashat Islam
Riashat Islam
Microsoft Research NYC
Deep Reinforcement LearningDeep LearningGenerative Models
S
Shital Shah
Microsoft AI Frontiers
C
Christian Schröder de Witt
University of Oxford
Charles London
Charles London
DPhil Student in CS, University of Oxford
machine learninglearning theorydeep learningstatistics