TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the exploration-exploitation dilemma faced by large language models (LLMs) in iterative self-correction, this paper proposes a tree-guided adaptive policy optimization framework. Methodologically, it innovatively integrates GRPO—a reinforcement learning algorithm—with Thompson sampling–based tree search: refinement paths are explicitly modeled as a tree structure, and Thompson sampling dynamically balances exploration of promising yet uncertain paths against exploitation of empirically successful ones, enabling online policy adaptation. The framework significantly enhances LLMs’ fine-grained error correction capability on complex reasoning tasks. On HumanEval, MBPP, and APPS benchmarks, it achieves up to a 4.2-percentage-point improvement in pass@1 and a 12.51-percentage-point gain in pass@10, outperforming all existing baselines. Its core contribution lies in being the first to introduce a Bayesian-inspired tree search mechanism into LLM iterative refinement, yielding an interpretable, traceable, and efficiently convergent adaptive optimization process.

Technology Category

Application Category

📝 Abstract

Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhances LLM self-debugging through adaptive tree search

Improves iterative refinement with structured exploration policies

Optimizes code generation benchmarks via guided policy learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines GRPO with Thompson-Sampling tree search

Explores both failed and successful refinement paths

Enhances iterative refinement with adaptive policies

🔎 Similar Papers

Training LLMs to Better Self-Debug and Explain Code