ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scientific large language models (LLMs) frequently generate factually erroneous outputs (“hallucinations”), yet existing binary evaluation methods lack the granularity needed for precise diagnostic analysis. Method: We introduce SciHalluBench—the first expert-annotated benchmark comprising 1,001 scientific question-answer pairs, featuring fine-grained hallucination annotations with exact error spans and taxonomy (e.g., conceptual confusion, numerical inaccuracy, causal inversion). We propose a three-stage evaluation protocol—detection, localization, and correction—rigorously validated via multi-level expert review and domain-specific classification. Contribution/Results: Evaluation across nine state-of-the-art LLMs reveals an average hallucination detection accuracy of only ~50%; even top-tier models like GPT-4o fail to reliably distinguish factual from hallucinated responses. This exposes fundamental limitations in both current scientific LLMs and the LLM-as-judge paradigm for factual assessment. SciHalluBench establishes a new, interpretable, and attribution-aware benchmark for scientifically grounded truthfulness evaluation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce extbf{ReFACT} ( extit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question--answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with extbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($sim$50% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of extit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on href{https://github.com/ddz5431/ReFACT}{GitHub}footnote{We provide the dataset at: https://github.com/ddz5431/ReFACT}.
Problem

Research questions and friction points this paper is trying to address.

Detecting scientific confabulation in LLMs with precise error annotations
Benchmarking LLM performance on fine-grained error localization and correction
Evaluating reliability of LLM-as-judge paradigms for scientific factuality
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReFACT benchmark detects scientific confabulation with fine-grained annotations
It provides precise error spans and error-types for localization
Enables multi-stage evaluation including detection and correction
🔎 Similar Papers
No similar papers found.
Y
Yindong Wang
Hasso Plattner Institute (HPI), University of Potsdam, Germany
M
Martin Preiß
Hasso Plattner Institute (HPI), University of Potsdam, Germany
M
Margarita Bugueño
Hasso Plattner Institute (HPI), University of Potsdam, Germany
J
Jan Vincent Hoffbauer
Hasso Plattner Institute (HPI), University of Potsdam, Germany
A
Abdullatif Ghajar
Hasso Plattner Institute (HPI), University of Potsdam, Germany
T
Tolga Buz
Hasso Plattner Institute (HPI), University of Potsdam, Germany
Gerard de Melo
Gerard de Melo
Professor at Hasso Plattner Institute / University of Potsdam
Artificial IntelligenceNatural Language ProcessingWeb Mining