ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Scientific large language models (LLMs) frequently generate factually erroneous outputs (“hallucinations”), yet existing binary evaluation methods lack the granularity needed for precise diagnostic analysis. Method: We introduce SciHalluBench—the first expert-annotated benchmark comprising 1,001 scientific question-answer pairs, featuring fine-grained hallucination annotations with exact error spans and taxonomy (e.g., conceptual confusion, numerical inaccuracy, causal inversion). We propose a three-stage evaluation protocol—detection, localization, and correction—rigorously validated via multi-level expert review and domain-specific classification. Contribution/Results: Evaluation across nine state-of-the-art LLMs reveals an average hallucination detection accuracy of only ~50%; even top-tier models like GPT-4o fail to reliably distinguish factual from hallucinated responses. This exposes fundamental limitations in both current scientific LLMs and the LLM-as-judge paradigm for factual assessment. SciHalluBench establishes a new, interpretable, and attribution-aware benchmark for scientifically grounded truthfulness evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce extbf{ReFACT} ( extit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question--answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with extbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($sim$50% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of extit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on href{https://github.com/ddz5431/ReFACT}{GitHub}footnote{We provide the dataset at: https://github.com/ddz5431/ReFACT}.

Problem

Research questions and friction points this paper is trying to address.

Detecting scientific confabulation in LLMs with precise error annotations

Benchmarking LLM performance on fine-grained error localization and correction

Evaluating reliability of LLM-as-judge paradigms for scientific factuality

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReFACT benchmark detects scientific confabulation with fine-grained annotations

It provides precise error spans and error-types for localization

Enables multi-stage evaluation including detection and correction

🔎 Similar Papers

Grounding Fallacies Misrepresenting Scientific Publications in Evidence