🤖 AI Summary
This work addresses the limited long-horizon planning and multi-step reasoning capabilities of large language models (LLMs) in complex knowledge graphs by introducing the first quantifiable benchmark based on Wikipedia hyperlink navigation—formalizing the “WikiRacing” task, which requires models to plan a path from a source page to a target page. Leveraging the real-world Wikipedia knowledge graph, the benchmark comprises navigation tasks of varying difficulty and systematically evaluates state-of-the-art models, including Gemini-3, GPT-5, and Claude Opus 4.5. Experimental results show that while top models surpass human performance on simple tasks, their success rate drops sharply to 23% on challenging ones, revealing fundamental limitations in dynamic replanning and loop avoidance, and underscoring long-horizon planning as a critical bottleneck in current LLMs.
📝 Abstract
We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.