MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
Existing medical fact-checking datasets largely overlook content generated by large language models (LLMs), particularly lacking high-quality, evidence-based Chinese medical verification resources. Method: We introduce MedFact—the first evidence-based Chinese medical fact-checking dataset—comprising 1,321 real-world clinical questions and 7,409 claims. We propose a systematic LLM-oriented data construction framework integrating clinical expert review and iterative human annotation to ensure rigor and reliability. Contribution/Results: Extensive in-context learning and fine-tuning experiments across diverse mainstream LLMs reveal critical, previously unreported deficiencies in Chinese medical fact-checking performance. MedFact is publicly released, establishing the first benchmark and reproducible evaluation standard for this domain, thereby advancing the development of safe, trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract
Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.
Problem

Research questions and friction points this paper is trying to address.

Verifying medical content generated by large language models
Addressing the gap in Chinese evidence-based medical fact-checking datasets
Evaluating LLM performance on complex real-world medical scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese dataset for LLM medical fact-checking
Contains 1321 questions and 7409 medical claims
Tests both in-context learning and fine-tuning approaches
🔎 Similar Papers