🤖 AI Summary
Current large language models (LLMs) suffer from coarse-grained self-verification mechanisms in complex reasoning tasks, resulting in weak error correction and poor interpretability. To address this, we propose the Socratic Self-Refine (SSR) framework, which introduces a novel Socratic introspection mechanism: it decomposes reasoning into verifiable subproblem–subanswer pairs, enabling stepwise confidence estimation, controlled re-solving, self-consistency verification, and iterative refinement. SSR achieves fine-grained evaluation and precise correction without requiring access to model internals. Evaluated across five reasoning benchmarks and three major LLM families, SSR consistently outperforms existing self-refinement methods, delivering stable improvements in reasoning accuracy while enhancing decision transparency and controllability.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.