🤖 AI Summary
This work addresses a key limitation of conventional rejection sampling in mathematical reasoning fine-tuning: its exclusive retention of correct reasoning trajectories discards valuable learning signals embedded in erroneous attempts, thereby failing to model the trial-and-error nature of human-like reasoning. To overcome this, the authors propose TrajFusion, a method that reframes rejection sampling as a structured supervised data construction process. TrajFusion interleaves incorrect trajectories, reflective prompts, and correct solutions to synthesize fused training samples, with sample length adaptively controlled based on error frequency and diversity. Notably, it is the first approach to explicitly incorporate both error trajectories and reflection mechanisms into the supervision signal, effectively modeling trial-and-error reasoning without altering model architecture or training objectives, while naturally degenerating to standard rejection sampling when errors carry no informative content. Experiments demonstrate that TrajFusion significantly outperforms traditional methods across multiple mathematical reasoning benchmarks, particularly excelling in complex, long-chain reasoning tasks.
📝 Abstract
Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.