🤖 AI Summary
This work addresses the challenge of reliably assessing answer credibility during language model inference to enable high-confidence selective prediction. The authors propose the Prover-Verifier Debate (PVD) protocol, which introduces interactive proof theory into large language model–based selective prediction for the first time. Specifically, a frozen large language model is instantiated in dual roles—as a prover and a verifier—that engage in multi-turn debates, where the prover defends its claims and the verifier challenges them, guided by an Accept/Challenge/Reject adjudication logic. This process yields structured dialogues that produce both an answer and an interpretable confidence signal. On the GPQA Diamond benchmark, the method achieves approximately a 30-percentage-point improvement in high-confidence accuracy on the “Accepted No-Change” (ANC) subset compared to non-ANC cases, and the confidence signals demonstrate strong transferability across different model families.
📝 Abstract
Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.