🤖 AI Summary
Current large language models (LLMs) rely on implicit, unstructured reasoning, resulting in unstable inference paths, limited error correction, and poor reuse of prior experience. To address these limitations, we propose a structured reasoning framework comprising three core components: (1) trajectory analysis to extract successful reasoning paths and distill transferable, structured reasoning guidelines; (2) automated reflection signal extraction from failure cases to enable dynamic step-wise refinement; and (3) a self-consistency verification mechanism coupled with an error-feedback loop for unsupervised, iterative optimization—without fine-tuning. Our framework supports cross-task guideline sharing and multi-model collaboration. Empirically, it achieves significant improvements over strong baselines on BBH, GSM8K, MATH-500, MBPP, and HumanEval, while simultaneously enhancing inference stability and cross-domain generalization.
📝 Abstract
Large language models (LLMs) have advanced general-purpose reasoning, showing strong performance across diverse tasks. However, existing methods often rely on implicit exploration, where the model follows stochastic and unguided reasoning paths-like walking without a map. This leads to unstable reasoning paths, lack of error correction, and limited learning from past experience. To address these issues, we propose a framework that shifts from implicit exploration to structured reasoning through guideline and refinement. First, we extract structured reasoning patterns from successful trajectories and reflective signals from failures. During inference, the model follows these guidelines step-by-step, with refinement applied after each step to correct errors and stabilize the reasoning process. Experiments on BBH and four additional benchmarks (GSM8K, MATH-500, MBPP, HumanEval) show that our method consistently outperforms strong baselines across diverse reasoning tasks. Structured reasoning with stepwise execution and refinement improves stability and generalization, while guidelines transfer well across domains and flexibly support cross-model collaboration, matching or surpassing supervised fine-tuning in effectiveness and scalability.