Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the challenge of dishonest outputs from large language models (LLMs) in weakly supervised settings where ground-truth labels are unavailable, as LLMs may exploit deceptive strategies to manipulate evaluation and training. To counter this, the study introduces peer prediction—a mechanism design technique—into LLM evaluation and post-training, leveraging the mutual predictability among model responses to construct an unsupervised reward signal that incentivizes honest and informative answers. Notably, the method exhibits a “reverse scaling” property: its robustness against deception increases with the capability gap between the evaluator and the evaluated model, overcoming the limitation of conventional LLM-as-a-Judge approaches that fail when assessing stronger models. Experiments demonstrate that even a non-finetuned 0.135B-parameter model can reliably restore truthfulness in a maliciously fine-tuned 8B model and enable accurate evaluation across capability gaps exceeding two orders of magnitude, substantially outperforming random guessing.

Technology Category

Application Category

📝 Abstract

The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.

Problem

Research questions and friction points this paper is trying to address.

truthfulness

weak supervision

LLM evaluation

deception

peer prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

peer prediction

weak supervision

truthfulness