Procedural Knowledge at Scale Improves Reasoning

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the challenge that existing test-time scaling methods struggle to systematically reuse procedural knowledge—such as problem reformulation, method selection, and verification backtracking—in complex reasoning. The authors propose the Reasoning Memory framework, which leverages retrieval-augmented generation to structure reasoning trajectories into subproblem–subroutine pairs, forming a knowledge base that enables dynamic retrieval and reuse of relevant subroutines as implicit priors during inference. This approach achieves, for the first time, scalable extraction and efficient invocation of procedural knowledge, overcoming the limitations of conventional RAG systems that rely solely on documents or full reasoning traces. Evaluated across six benchmarks in mathematics, science, and programming, the method significantly outperforms existing RAG variants and compute-matched test-time scaling baselines, with gains up to 19.2%, highlighting the critical importance of broad procedural knowledge coverage and effective retrieval design.

Technology Category

Application Category

📝 Abstract

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

Problem

Research questions and friction points this paper is trying to address.

procedural knowledge

reasoning

test-time scaling

knowledge reuse

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural knowledge

retrieval-augmented generation

reasoning memory