FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI supervision scaling is hindered by challenges in reliable verification and scarcity of high-quality, fine-grained evaluation data. Method: We introduce the first long-context benchmark dataset spanning five domains—medicine, mathematics, science, programming, and Lojban—each containing expert-verified correct solutions and defective solutions annotated with structured, fine-grained error types. Contribution/Results: We propose a novel data paradigm integrating multi-domain coverage, long-context reasoning, and taxonomy-driven error annotation. Our systematic analysis reveals pronounced task dependency and cross-domain gradients in large language models’ critical evaluation capabilities; notably, human experts still substantially outperform state-of-the-art models on several tasks, establishing a robust gold-standard verifier benchmark. The dataset supports training and evaluation of debate-based, critique-driven, and proof-verifier supervision protocols. It is publicly released to advance scalable, verifiable AI supervision research.

Technology Category

Application Category

📝 Abstract
As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.
Problem

Research questions and friction points this paper is trying to address.

Lack of datasets with expert-verified correct and flawed solutions
Need for scalable oversight methods in complex AI problem-solving
Evaluating AI models' critiquing capabilities for reliable human oversight
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse datasets with expert annotations
Models critique flawed reasoning solutions
Prover-verifier games for scalable oversight
🔎 Similar Papers
Gabriel Recchia
Gabriel Recchia
Modulo Research Ltd
Cognitive Science
C
Chatrik Singh Mangat
Vector Research
I
Issac Li
Princeton University
G
Gayatri Krishnakumar
Impact Academy