LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limited long-horizon planning and multi-step reasoning capabilities of large language models (LLMs) in complex knowledge graphs by introducing the first quantifiable benchmark based on Wikipedia hyperlink navigation—formalizing the “WikiRacing” task, which requires models to plan a path from a source page to a target page. Leveraging the real-world Wikipedia knowledge graph, the benchmark comprises navigation tasks of varying difficulty and systematically evaluates state-of-the-art models, including Gemini-3, GPT-5, and Claude Opus 4.5. Experimental results show that while top models surpass human performance on simple tasks, their success rate drops sharply to 23% on challenging ones, revealing fundamental limitations in dynamic replanning and loop avoidance, and underscoring long-horizon planning as a critical bottleneck in current LLMs.

Technology Category

Application Category

📝 Abstract

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

Problem

Research questions and friction points this paper is trying to address.

long-term planning

reasoning

knowledge graphs

large language models

Wikirace

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon reasoning

planning benchmark

knowledge graph navigation