🤖 AI Summary
Large language models (LLMs) frequently regenerate identical intermediate reasoning steps in multi-step inference, leading to excessive token consumption, increased latency, context saturation, and diminished exploratory capacity. To address this, we propose the *Behavior Handbook*—a meta-cognitive mechanism that identifies redundant reasoning patterns via trajectory analysis, automatically distills reusable, structured behavioral units from chain-of-thought traces through clustering, and supports three application modes: contextual injection, self-improvement, and supervised fine-tuning (SFT). Our method integrates behavior-conditioned reasoning, in-context learning, self-critique, and SFT optimization. Experiments show a 46% reduction in inference tokens with maintained or improved accuracy; a 10% relative gain in self-improvement performance over baselines; and more efficient SFT that enhances generalizable reasoning. The core contribution is the explicit modeling of implicit reasoning as modular, retrievable, composable, and evolvable behavioral knowledge.
📝 Abstract
Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and latency. This saturation of the context window leaves less capacity for exploration. We study a simple mechanism that converts recurring reasoning fragments into concise, reusable "behaviors" (name + instruction) via the model's own metacognitive analysis of prior traces. These behaviors are stored in a "behavior handbook" which supplies them to the model in-context at inference or distills them into parameters via supervised fine-tuning. This approach achieves improved test-time reasoning across three different settings - 1) Behavior-conditioned inference: Providing the LLM relevant behaviors in-context during reasoning reduces number of reasoning tokens by up to 46% while matching or improving baseline accuracy; 2) Behavior-guided self-improvement: Without any parameter updates, the model improves its own future reasoning by leveraging behaviors from its own past problem solving attempts. This yields up to 10% higher accuracy than a naive critique-and-revise baseline; and 3) Behavior-conditioned SFT: SFT on behavior-conditioned reasoning traces is more effective at converting non-reasoning models into reasoning models as compared to vanilla SFT. Together, these results indicate that turning slow derivations into fast procedural hints enables LLMs to remember how to reason, not just what to conclude.