PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine logical reasoning capabilities or merely rely on memorized templates—termed “phantom recall.” To this end, we introduce PHANTOM RECALL, a benchmark comprising 25 classic logic puzzles and 149 structurally preserved yet semantically perturbed variants. We propose a fine-grained error taxonomy, an automated logical equivalence checker, and a targeted prompt engineering framework. Experiments reveal that while mainstream LLMs achieve high accuracy on original puzzles, their performance degrades substantially on variants, exhibiting widespread phantom recall and over-explanation artifacts. This study provides the first systematic evidence of the intrinsic fragility of LLMs’ logical reasoning. Moreover, we propose a transferable prompt reformulation method that enhances robustness, establishing a foundation for rigorous evaluation and improvement of model reasoning reliability.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' genuine reasoning versus memorization in logic puzzles
Assessing model performance collapse under puzzle perturbations
Developing tools to detect and mitigate phantom recall errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated logical-equivalence judge detects reasoning mismatches
Taxonomy categorizes fine-grained reasoning error types
Prompting-based mitigation framework guided by error categories
🔎 Similar Papers
No similar papers found.
S
Souradeep Mukhopadhyay
School of Computing and Augmented Intelligence, Arizona State University
R
Rishabh Baral
School of Computing and Augmented Intelligence, Arizona State University
N
Nimeesh Mahajan
School of Computing and Augmented Intelligence, Arizona State University
S
Samhitha Harish
School of Computing and Augmented Intelligence, Arizona State University
Aswin RRV
Aswin RRV
PhD CS, Arizona State University
Reinforcement LearningSelf-Supervised LearningReasoning in Language Models
M
Mihir Parmar
School of Computing and Augmented Intelligence, Arizona State University
M
Mutsumi Nakamura
School of Computing and Augmented Intelligence, Arizona State University
Chitta Baral
Chitta Baral
Professor of Computer Science, Arizona State University
Knowledge RepresentationNLPVisionRoboticsIntegrated Systems