🤖 AI Summary
This work addresses the challenge of costly human annotation of negative samples for reference faithfulness detection in Chinese Retrieval-Augmented Generation (RAG) systems. To overcome this, we propose a low-cost, two-stage human annotation paradigm and introduce CiteCheck—the first large-scale, high-quality, and class-balanced Chinese benchmark dataset for citation faithfulness evaluation. Our methodology integrates LLM-assisted negative sampling, human-in-the-loop verification, and parameter-efficient fine-tuning (PEFT), substantially reducing annotation effort. Experiments reveal that state-of-the-art (SOTA) large language models still achieve limited accuracy on this challenging benchmark, whereas smaller LLMs—enhanced via LLM-generated training data and fine-tuned with PEFT—attain competitive performance. The CiteCheck dataset is publicly released, establishing a critical infrastructure for advancing trustworthy RAG research in Chinese.
📝 Abstract
Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.