π€ AI Summary
Existing supervised fine-tuning approaches often inherit ineffective steps and logical flaws from teacher trajectories, lacking direct optimization for reasoning validity and trajectory efficiency. This work proposes a bidirectional optimization framework for trajectory generation that, for the first time, converts developer-provided reference patches into implicit process graphs to guide trajectory selection. In the backward phase, it constructs a context-aware factual graph aligned with solution milestones via knowledge distillation; in the forward phase, it scores and prunes teacher trajectories based on this graph, retaining only the shortest valid segments while incorporating a groundedness check to prevent information leakage. Using merely 1.8k curated samples, the method achieves a 10.8 percentage point improvement in Pass@1 on SWE-bench Verified and reduces inference cost by approximately 15%, with consistent gains also observed on SWE-bench Lite.
π Abstract
Supervised fine-tuning (SFT) on long teacher trajectories is the dominant way to instill investigation and reasoning in open software-engineering (SWE) agents. Since every retained response becomes an imitation target, the student inherits the final outcome and intermediate flaws, including ungrounded leaps and redundant loops. High-quality training data must be effective(each step is grounded and narrows the agent's epistemic gap to the correct fix) and efficient(each step is information-bearing rather than redundant or looping). Existing recipes filter or relabel teacher rollouts using only a binary terminal verifier, which does not directly target these axes and provides no supervision on instances where the teacher fails.
Most real issue includes a developer-authored reference patch, $p^\star$, revealing the file paths, runtime behaviors, and coding conventions presupposed by the correct fix, yet standard pipelines discard it. We propose Patches-to-Trajectories (P2T), which uses $p^\star$ as privileged information during curation and formulates trajectory construction as bi-objective optimization over per-step effectiveness and trajectory length. A reverse phase distills $p^\star$ into a latent process graph, $G^\star$, of contextual facts and solution milestones. A forward phase curates trajectories from blinded teacher continuations by scoring per-step progress against $G^\star$ under a leakage-blocking groundedness check and retaining the shortest effective segments.
Using only 1.8k curated SWE-Gym instances, P2T improves effectiveness and efficiency over outcome-filtered SFT and its tool-error-masking variant. On SWE-bench Verified, it raises Pass@1 by up to 10.8 points while reducing per-instance inference cost by ~15%, with consistent gains on SWE-bench Lite. Size-matched ablations and qualitative analysis further isolate trajectory quality from data scale.