Eliciting Reasoning in Language Models with Cognitive Tools

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large language models (LLMs) suffer from opaque reasoning mechanisms and often rely on lengthy chain-of-thought (CoT) prompting or computationally expensive reinforcement learning. Method: We propose a cognitive psychology–inspired, lightweight reasoning enhancement paradigm that materializes modular “cognitive tools”—such as decomposition and verification—as callable native functions embedded directly into the LLM’s inference workflow. Our approach requires no fine-tuning, RLHF, or CoT distillation; instead, it enables explicit, stepwise reasoning via zero-shot prompting and autonomous tool invocation by the LLM. Contribution/Results: This work is the first to concretize classical cognitive architecture principles into a scalable, interpretable LLM reasoning interface. Experiments show GPT-4.1 achieves 43.3% pass@1 on AIME2024 (+16.6 percentage points), approaching o1-preview performance; open-source models—including Qwen2.5-Math—also demonstrate substantial gains, confirming the method’s generality and effectiveness.

Technology Category

Application Category

📝 Abstract

The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of"cognitive tools"encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our"cognitive tools"to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.

Problem

Research questions and friction points this paper is trying to address.

Exploring alternative methods to elicit reasoning in LLMs

Implementing cognitive tools to enhance reasoning performance

Debating post-training vs pre-training roles in reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses cognitive tools for modular reasoning operations

Implements agentic tool-calling framework in LLMs

Enhances performance on mathematical reasoning benchmarks

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting