🤖 AI Summary
Scientific large language models (LLMs) frequently generate factually erroneous outputs (“hallucinations”), yet existing binary evaluation methods lack the granularity needed for precise diagnostic analysis. Method: We introduce SciHalluBench—the first expert-annotated benchmark comprising 1,001 scientific question-answer pairs, featuring fine-grained hallucination annotations with exact error spans and taxonomy (e.g., conceptual confusion, numerical inaccuracy, causal inversion). We propose a three-stage evaluation protocol—detection, localization, and correction—rigorously validated via multi-level expert review and domain-specific classification. Contribution/Results: Evaluation across nine state-of-the-art LLMs reveals an average hallucination detection accuracy of only ~50%; even top-tier models like GPT-4o fail to reliably distinguish factual from hallucinated responses. This exposes fundamental limitations in both current scientific LLMs and the LLM-as-judge paradigm for factual assessment. SciHalluBench establishes a new, interpretable, and attribution-aware benchmark for scientifically grounded truthfulness evaluation.
📝 Abstract
Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce extbf{ReFACT} ( extit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question--answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with extbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($sim$50% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of extit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on href{https://github.com/ddz5431/ReFACT}{GitHub}footnote{We provide the dataset at: https://github.com/ddz5431/ReFACT}.