LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Current language models exhibit significant limitations in complex reasoning tasks requiring long-horizon, multi-step dependencies. This work introduces a scalable benchmark comprising 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic, which for the first time systematically isolates and quantifies the sources of model failure in ultra-long reasoning chains. The benchmark features graph-structured, multi-hop reasoning problems with verifiable answers and fine-grained modeling of stepwise dependencies. Experimental results reveal that state-of-the-art models—including GPT-5.2 and Gemini 3 Pro—achieve accuracy rates below 10%, underscoring a critical gap in their capacity for extended chain-of-thought reasoning.

Technology Category

Application Category

📝 Abstract
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
Problem

Research questions and friction points this paper is trying to address.

long-horizon reasoning
chain-of-thought
language models
reasoning benchmark
complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-horizon reasoning
Chain-of-Thought (CoT)
reasoning benchmark
scalable evaluation
interdependent reasoning steps
S
Sumeet Ramesh Motwani
University of Oxford
Daniel Nichols
Daniel Nichols
Doctoral Student, University of Maryland, College Park
computer sciencehigh performance computingdeep learning
Charles London
Charles London
DPhil Student in CS, University of Oxford
machine learninglearning theorydeep learningstatistics
P
Peggy Li
Lawrence Livermore National Laboratory (LLNL)
Fabio Pizzati
Fabio Pizzati
MBZUAI
computer visiondeep learninggenerative models
A
Acer Blake
University of Oxford
H
Hasan Hammoud
KAUST
T
Tavish McDonald
Lawrence Livermore National Laboratory (LLNL)
Akshat Naik
Akshat Naik
Graduate Student, University of Oxford
ai safetyalignmentevaluations
A
Alesia Ivanova
University of Oxford
V
Vignesh Baskaran
Hexo AI
Ivan Laptev
Ivan Laptev
Professor at MBZUAI, on leave from INRIA
Computer VisionRoboticsAction RecognitionObject Recognition
R
Ruben Glatt
Lawrence Livermore National Laboratory (LLNL)
Tal Ben-Nun
Tal Ben-Nun
Lawrence Livermore National Laboratory
High Performance ComputingParallel and Distributed AlgorithmsProgramming ModelsMachine Learning
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
Natasha Jaques
Natasha Jaques
University of Washington, Google Research
Social reinforcement learningMachine learningdeep learningmulti-agenthuman-AI interaction
Ameya Prabhu
Ameya Prabhu
Tübingen AI Center, University of Tübingen
Data-Centric MLScience of BenchmarkingContinual LearningEconomics of Transformative AI
Brian Bartoldson
Brian Bartoldson
Lawrence Livermore National Laboratory
machine learningartificial intelligence
Bhavya Kailkhura
Bhavya Kailkhura
Research Scientist, Lawrence Livermore National Laboratory
AI Security & AlignmentCompressed & Fast AI
Christian Schroeder de Witt
Christian Schroeder de Witt
University of Oxford
Multi-agent LearningSecuritySafety