🤖 AI Summary
Current language models exhibit significant limitations in complex reasoning tasks requiring long-horizon, multi-step dependencies. This work introduces a scalable benchmark comprising 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic, which for the first time systematically isolates and quantifies the sources of model failure in ultra-long reasoning chains. The benchmark features graph-structured, multi-hop reasoning problems with verifiable answers and fine-grained modeling of stepwise dependencies. Experimental results reveal that state-of-the-art models—including GPT-5.2 and Gemini 3 Pro—achieve accuracy rates below 10%, underscoring a critical gap in their capacity for extended chain-of-thought reasoning.
📝 Abstract
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.