Reasoning LLMs are Wandering Solution Explorers

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current reasoning-oriented large language models (RLLMs) function as “wanderers” rather than “systematic explorers,” lacking structured, verifiable exploration of solution spaces—leading to spurious steps, redundant search, hallucinated conclusions, and sharp performance degradation on complex tasks. Method: We formally define systematic problem-solving capability for RLLMs and propose a novel evaluation paradigm centered on the structural integrity of reasoning processes. Our framework introduces process-aware evaluation metrics and tools, integrating qualitative case analysis with quantitative process tracing across mainstream reasoning techniques—including chain-of-thought and tree search. Contribution/Results: We conduct multi-model, cross-task diagnostics over several state-of-the-art RLLMs. Experimental results validate the framework’s effectiveness in identifying systematic reasoning deficiencies, offering a reproducible methodology for diagnosing reasoning brittleness and guiding robustness improvements.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
Problem

Research questions and friction points this paper is trying to address.

LLMs lack systematic solution space exploration
Current models exhibit invalid reasoning and hallucinations
Performance degrades sharply with increasing complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes systematic problem solving criteria
Identifies common failure modes in reasoning
Advocates metrics for reasoning process evaluation
🔎 Similar Papers
No similar papers found.
J
Jiahao Lu
NUS AI Institute, National University of Singapore
Ziwei Xu
Ziwei Xu
National University of Singapore
Machine LearningKnowledge RepresentationAI Safety
M
Mohan Kankanhalli
NUS AI Institute, National University of Singapore