🤖 AI Summary
Large reasoning models (LRMs) suffer from the “prefix-dominance trap,” wherein suboptimal initial reasoning prefixes severely impede subsequent self-correction. To address this, we propose a multi-path collaborative reasoning mechanism—the first to enable real-time intermediate summary sharing across reasoning paths, dynamic integration of peer insights, and routing-driven state sharing with adaptive summary aggregation. Furthermore, we introduce LeaP-T, a lightweight fine-tuning paradigm that enhances small models’ adherence to reflective instructions via reflection-aware supervised fine-tuning. Experiments demonstrate that QwQ-32B augmented with LeaP achieves an average +5.0-point gain on mathematical benchmarks, outperforming DeepSeek-R1-671B. Meanwhile, LeaP-T-7B matches the performance of a 14B distilled model while exhibiting faster error correction and greater robustness.
📝 Abstract
Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the"Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .