🤖 AI Summary
Open-source small-scale LLMs (<100B) underperform on software repository problem-solving tasks due to low-quality chain-of-thought (CoT) data; existing approaches rely on weak rejection sampling and lack rigorous validation of intermediate reasoning steps, leading to error accumulation. Method: We propose MCTS-REFINE, the first reflective Monte Carlo Tree Search (MCTS) algorithm for code repair, which decomposes problem solving into three sequential stages—file localization, defect localization, and patch generation—and enforces exact match between each stage’s intermediate output and the ground-truth developer patch. Our method integrates structured CoT synthesis, strong rejection sampling based on exact matching, and multi-stage verification. Contribution/Results: Fine-tuned on Qwen2.5-72B-Inst, MCTS-REFINE achieves 28.3% and 35.0% resolution rates on SWE-bench Lite and Verified, respectively—substantially surpassing prior state-of-the-art models of comparable scale.
📝 Abstract
LLMs demonstrate strong performance in auto-mated software engineering, particularly for code generation and issue resolution. While proprietary models like GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence, cost, and privacy concerns limit adoption. Open-source alternatives offer transparency but underperform in complex tasks, especially sub-100B parameter models. Although quality Chain-of-Thought (CoT) data can enhance reasoning, current methods face two critical flaws: (1) weak rejection sampling reduces data quality, and (2) inadequate step validation causes error accumulation. These limitations lead to flawed reasoning chains that impair LLMs'ability to learn reliable issue resolution. The paper proposes MCTS-REFINE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimizes intermediate reasoning steps through a rigorous rejection sampling strategy, generating high-quality CoT data to improve LLM performance in issue resolution tasks. Key innovations include: (1) augmenting MCTS with a reflection mechanism that corrects errors via rejection sampling and refinement, (2) decomposing issue resolution into three subtasks-File Localization, Fault Localization, and Patch Generation-each with clear ground-truth criteria, and (3) enforcing a strict sampling protocol where intermediate outputs must exactly match verified developer patches, ensuring correctness across reasoning paths. Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.Notably, Qwen2.5-72B- Instruct achieves 28.3%(Lite) and 35.0%(Verified) resolution rates, surpassing SOTA baseline SWE-Fixer-Qwen-72B with the same parameter scale, which only reached 24.7%(Lite) and 32.8%(Verified).