🤖 AI Summary
To address the exploration-exploitation dilemma faced by large language models (LLMs) in iterative self-correction, this paper proposes a tree-guided adaptive policy optimization framework. Methodologically, it innovatively integrates GRPO—a reinforcement learning algorithm—with Thompson sampling–based tree search: refinement paths are explicitly modeled as a tree structure, and Thompson sampling dynamically balances exploration of promising yet uncertain paths against exploitation of empirically successful ones, enabling online policy adaptation. The framework significantly enhances LLMs’ fine-grained error correction capability on complex reasoning tasks. On HumanEval, MBPP, and APPS benchmarks, it achieves up to a 4.2-percentage-point improvement in pass@1 and a 12.51-percentage-point gain in pass@10, outperforming all existing baselines. Its core contribution lies in being the first to introduce a Bayesian-inspired tree search mechanism into LLM iterative refinement, yielding an interpretable, traceable, and efficiently convergent adaptive optimization process.
📝 Abstract
Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.