Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) exhibit severe hallucination in scientific text generation—particularly due to domain-specific terminology, statistical reasoning errors, and context-dependent factual distortions. Method: We introduce CAP, the first multilingual scientific hallucination detection dataset covering nine languages, comprising 900 scientific questions and over 7,000 LLM-generated answers. CAP employs a novel fine-grained dual-labeling scheme (factual correctness + textual fluency), validated via human annotation, logits-based confidence scoring, and outputs from 16 publicly available LLMs. A cross-lingual sampling strategy ensures balanced representation across high- and low-resource languages. Contribution/Results: CAP enables the first systematic annotation and analysis of scientific hallucinations across languages, uncovering consistent model biases in expert domains. It releases token-level annotations and model confidence scores, significantly advancing scientific hallucination detection, multilingual trustworthiness evaluation, and robust NLP system development.

Technology Category

Application Category

📝 Abstract

We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.

Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in LLM-generated scientific texts across multiple languages

Addressing factuality errors and fluency issues in multilingual scientific NLP systems

Evaluating LLM reliability in specialized domains with complex terminology and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset for scientific hallucination detection

Cross-lingual evaluation covering nine different languages

Annotated question-answer pairs with factuality and fluency labels

🔎 Similar Papers

AutoHall: Automated Hallucination Dataset Generation for Large Language Models