CORE: Collaborative Reasoning via Cross Teaching

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of large language models in complex reasoning tasks, often caused by complementary errors. To mitigate this, the authors propose CORE, a collaborative reasoning framework that introduces inter-model collaboration during training for the first time. CORE employs a two-stage process—comprising independent sampling followed by contextual rescue—augmented with a determinantal point process (DPP)-inspired diversity regularizer and an explicit rescue reward, all jointly optimized via reinforcement learning to enhance correctness, diversity, and rescue capability. Using only 1,000 training samples, a 3B+4B model ensemble achieves Pass@2 scores of 99.54% on GSM8K and 92.08% on MATH, substantially outperforming individual models. The approach also demonstrates breakthrough performance on challenging benchmarks such as GPQA and AIME, significantly boosting reasoning capabilities without increasing model scale.

Technology Category

Application Category

📝 Abstract
Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a training-time collaboration framework that converts peer success into a learning signal via a cross-teaching protocol. Each problem is solved in two stages: a cold round of independent sampling, followed by a contexted rescue round in which models that failed receive hint extracted from a successful peer. CORE optimizes a combined reward that balances (i) correctness, (ii) a lightweight DPP-inspired diversity term to reduce error overlap, and (iii) an explicit rescue bonus for successful recovery. We evaluate CORE across four standard reasoning datasets GSM8K, MATH, AIME, and GPQA. With only 1,000 training examples, a pair of small open source models (3B+4B) reaches Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, the 3B+4B pair reaches Pass@2 of 77.34% on GPQA (trained on 348 examples) and 79.65% on AIME (trained on 792 examples), using a training-time budget of at most 1536 context tokens and 3072 generated tokens. Overall, these results show that training-time collaboration can reliably convert model complementarity into large gains without scaling model size.
Problem

Research questions and friction points this paper is trying to address.

complementary reasoning errors
training-time collaboration
cross-teaching
reasoning performance
model complementarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Reasoning
Cross Teaching
Complementary Errors
DPP-inspired Diversity
Rescue-based Training
K
Kshitij Mishra
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
M
Mirat Aubakirov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
M
Martin Takac
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Nils Lukas
Nils Lukas
MBZUAI
ML SecurityAI SafetyPrivacy-preserving ML
Salem Lahlou
Salem Lahlou
MBZUAI
probabilistic modelinguncertainty estimationgflownetsLLM reasoning