π€ AI Summary
Scientific citation generation faces dual challenges: citation ambiguity and high hallucination rates in large language models (LLMs), severely undermining reliability in research applications. To address this, we introduce REASONSβa novel, fine-grained evaluation benchmark comprising sentence-level attribution annotations across 12 scientific disciplines. We propose a dual-scenario evaluation framework: *indirect querying* (sentence β paper title) and *direct querying* (sentence β author attribution). Methodologically, we innovate with context-aware metadata-enhanced retrieval to suppress hallucinations, integrating retrieval-augmented generation (RAG) with the Mistral model. Our approach reduces hallucination by 42% in indirect querying while matching the precision of top-tier models like GPT-4o. Empirical analysis uncovers a fundamental LLM limitation in aligning title and abstract semantics. REASONS establishes the first discipline-diverse, sentence-level citation evaluation standard and delivers a reproducible, hallucination-mitigated pipeline for trustworthy scientific AI.
π Abstract
Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication. Previous approaches to citation generation have been limited by citation ambiguity and LLM overgeneralization. We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv. Our evaluation framework covers two key citation scenarios: indirect queries (matching sentences to paper titles) and direct queries (author attribution), both enhanced with contextual metadata. We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B). While top-tier LLMs achieve high performance in sentence attribution, they struggle with high hallucination rates, a key metric for scientific reliability. Our metadata-augmented approach reduces hallucination rates across all tasks, offering a promising direction for improvement. Retrieval-augmented generation (RAG) with Mistral improves performance in indirect queries, reducing hallucination rates by 42% and maintaining competitive precision with larger models. However, adversarial testing highlights challenges in linking paper titles to abstracts, revealing fundamental limitations in current LLMs. REASONS provides a challenging benchmark for developing reliable and trustworthy LLMs in scientific applications