CiteCheck: Towards Accurate Citation Faithfulness Detection

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of costly human annotation of negative samples for reference faithfulness detection in Chinese Retrieval-Augmented Generation (RAG) systems. To overcome this, we propose a low-cost, two-stage human annotation paradigm and introduce CiteCheck—the first large-scale, high-quality, and class-balanced Chinese benchmark dataset for citation faithfulness evaluation. Our methodology integrates LLM-assisted negative sampling, human-in-the-loop verification, and parameter-efficient fine-tuning (PEFT), substantially reducing annotation effort. Experiments reveal that state-of-the-art (SOTA) large language models still achieve limited accuracy on this challenging benchmark, whereas smaller LLMs—enhanced via LLM-generated training data and fine-tuned with PEFT—attain competitive performance. The CiteCheck dataset is publicly released, establishing a critical infrastructure for advancing trustworthy RAG research in Chinese.

Technology Category

Application Category

📝 Abstract
Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.
Problem

Research questions and friction points this paper is trying to address.

Citation faithfulness detection in Chinese
Addressing scarcity of large-scale datasets
Reducing annotation costs effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Chinese dataset
Cost-effective manual annotation
LLM-generated negative samples
🔎 Similar Papers
No similar papers found.
Ziyao Xu
Ziyao Xu
Peking University
S
Shaohang Wei
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhuoheng Han
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
J
Jing Jin
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhe Yang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Xiaoguang Li
Xiaoguang Li
Noah's Ark Lab,HUAWEI
Question AnsweringInformation RetrievalDialogue Systems
Haochen Tan
Haochen Tan
City University of Hong Kong
NLPDeep Learning
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
H
Houfeng Wang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University