Learning from Peers in Reasoning Models

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large reasoning models (LRMs) suffer from the “prefix-dominance trap,” wherein suboptimal initial reasoning prefixes severely impede subsequent self-correction. To address this, we propose a multi-path collaborative reasoning mechanism—the first to enable real-time intermediate summary sharing across reasoning paths, dynamic integration of peer insights, and routing-driven state sharing with adaptive summary aggregation. Furthermore, we introduce LeaP-T, a lightweight fine-tuning paradigm that enhances small models’ adherence to reflective instructions via reflection-aware supervised fine-tuning. Experiments demonstrate that QwQ-32B augmented with LeaP achieves an average +5.0-point gain on mathematical benchmarks, outperforming DeepSeek-R1-671B. Meanwhile, LeaP-T-7B matches the performance of a 14B distilled model while exhibiting faster error correction and greater robustness.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the"Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Addressing the Prefix Dominance Trap in Large Reasoning Models

Enhancing self-correction in reasoning models via peer interaction

Improving small models' ability to follow summarization and reflection instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Learning from Peers (LeaP) for self-correction

Uses routing mechanism to share intermediate reasoning insights

Fine-tunes smaller models into LeaP-T series for better performance

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

2024-10-10Citations: 0

Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review

2024-10-04arXiv.orgCitations: 0

Authors to Follow