MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Open-source small-scale LLMs (<100B) underperform on software repository problem-solving tasks due to low-quality chain-of-thought (CoT) data; existing approaches rely on weak rejection sampling and lack rigorous validation of intermediate reasoning steps, leading to error accumulation. Method: We propose MCTS-REFINE, the first reflective Monte Carlo Tree Search (MCTS) algorithm for code repair, which decomposes problem solving into three sequential stages—file localization, defect localization, and patch generation—and enforces exact match between each stage’s intermediate output and the ground-truth developer patch. Our method integrates structured CoT synthesis, strong rejection sampling based on exact matching, and multi-stage verification. Contribution/Results: Fine-tuned on Qwen2.5-72B-Inst, MCTS-REFINE achieves 28.3% and 35.0% resolution rates on SWE-bench Lite and Verified, respectively—substantially surpassing prior state-of-the-art models of comparable scale.

Technology Category

Application Category

📝 Abstract

LLMs demonstrate strong performance in auto-mated software engineering, particularly for code generation and issue resolution. While proprietary models like GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence, cost, and privacy concerns limit adoption. Open-source alternatives offer transparency but underperform in complex tasks, especially sub-100B parameter models. Although quality Chain-of-Thought (CoT) data can enhance reasoning, current methods face two critical flaws: (1) weak rejection sampling reduces data quality, and (2) inadequate step validation causes error accumulation. These limitations lead to flawed reasoning chains that impair LLMs'ability to learn reliable issue resolution. The paper proposes MCTS-REFINE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimizes intermediate reasoning steps through a rigorous rejection sampling strategy, generating high-quality CoT data to improve LLM performance in issue resolution tasks. Key innovations include: (1) augmenting MCTS with a reflection mechanism that corrects errors via rejection sampling and refinement, (2) decomposing issue resolution into three subtasks-File Localization, Fault Localization, and Patch Generation-each with clear ground-truth criteria, and (3) enforcing a strict sampling protocol where intermediate outputs must exactly match verified developer patches, ensuring correctness across reasoning paths. Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.Notably, Qwen2.5-72B- Instruct achieves 28.3%(Lite) and 35.0%(Verified) resolution rates, surpassing SOTA baseline SWE-Fixer-Qwen-72B with the same parameter scale, which only reached 24.7%(Lite) and 32.8%(Verified).

Problem

Research questions and friction points this paper is trying to address.

Improving LLM issue resolution with high-quality CoT data

Addressing weak rejection sampling in reasoning chain validation

Enhancing open-source LLM performance in software engineering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

MCTS-REFINE enhances MCTS with reflection mechanism

Decomposes issue resolution into three subtasks

Enforces strict sampling protocol for correctness

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval