Debate Helps Weak Judges Reward Stronger Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work proposes a lightweight, training-free debate-based supervision framework—comprising answerer, critic, and judge—to enhance oversight when weak judges struggle to reliably supervise strong models. By introducing a critic with superior classification capabilities, the approach aids the judge in making accurate decisions on verifiable tasks. Crucially, the critic’s statements are treated as verifiable claims rather than testimonial evidence, and pre-deployment auditing criteria are designed to predict the effectiveness of the debate process. Empirical results demonstrate that this mechanism significantly outperforms consultation baselines across three eligible model pairings. Notably, ablating additional rebuttal rounds does not substantially degrade performance, indicating that a single, independent critique suffices to capture the primary benefit of the proposed paradigm.

📝 Abstract

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

Problem

Research questions and friction points this paper is trying to address.

debate

scalable oversight

weak judge

strong model

verifiable tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

scalable oversight

proposer-critic debate

weak judge