MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical fact-checking datasets largely overlook content generated by large language models (LLMs), particularly lacking high-quality, evidence-based Chinese medical verification resources. Method: We introduce MedFact—the first evidence-based Chinese medical fact-checking dataset—comprising 1,321 real-world clinical questions and 7,409 claims. We propose a systematic LLM-oriented data construction framework integrating clinical expert review and iterative human annotation to ensure rigor and reliability. Contribution/Results: Extensive in-context learning and fine-tuning experiments across diverse mainstream LLMs reveal critical, previously unreported deficiencies in Chinese medical fact-checking performance. MedFact is publicly released, establishing the first benchmark and reproducible evaluation standard for this domain, thereby advancing the development of safe, trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract
Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.
Problem

Research questions and friction points this paper is trying to address.

Verifying medical content generated by large language models
Addressing the gap in Chinese evidence-based medical fact-checking datasets
Evaluating LLM performance on complex real-world medical scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese dataset for LLM medical fact-checking
Contains 1321 questions and 7409 medical claims
Tests both in-context learning and fine-tuning approaches
🔎 Similar Papers
No similar papers found.
T
Tong Chen
School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, China
Zimu Wang
Zimu Wang
Tsinghua University
recommendation
Y
Yiyi Miao
School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, China
Haoran Luo
Haoran Luo
Nanyang Technological University
Knowledge GraphLarge Language ModelsGraph Neural Networks
Y
Yuanfei Sun
School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, China
W
Wei Wang
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, China
Zhengyong Jiang
Zhengyong Jiang
Xi’an Jiaotong-Liverpool University
Deep LearningReinforcement Learning
Procheta Sen
Procheta Sen
Lecturer/ Assistant Professor (University of Liverpool)
Natural Language ProcessingInformation RetrievalSocially Responsible AI
Jionglong Su
Jionglong Su
Xi'an Jiaotong-Liverpool University
AI Big Data Machine Learning Statistics