Peer-Predictive Self-Training for Language Model Reasoning

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates how language models can achieve continuous self-improvement without external supervision. To this end, the authors propose a self-training framework that operates without labeled data or a teacher-student architecture: multiple models collaboratively generate responses, whose aggregation yields an internal training signal. A pointwise mutual information (PMI)-based response alignment metric is introduced to dynamically modulate the contribution of each intermediate response to the self-training update. Evaluated on several mathematical reasoning benchmarks, the method improves exact-match accuracy by 2.2–4.3 percentage points and substantially narrows the performance gap between generator and verifier models, reducing it by 26%–40% on average. These results demonstrate the effectiveness of purely model-to-model interaction in driving autonomous improvement.

Technology Category

Application Category

📝 Abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

Problem

Research questions and friction points this paper is trying to address.

self-improvement

language models

reasoning

self-training

unsupervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Peer-Predictive Self-Training

self-supervised learning

cross-model aggregation