MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing medical fact-checking datasets largely overlook content generated by large language models (LLMs), particularly lacking high-quality, evidence-based Chinese medical verification resources. Method: We introduce MedFact—the first evidence-based Chinese medical fact-checking dataset—comprising 1,321 real-world clinical questions and 7,409 claims. We propose a systematic LLM-oriented data construction framework integrating clinical expert review and iterative human annotation to ensure rigor and reliability. Contribution/Results: Extensive in-context learning and fine-tuning experiments across diverse mainstream LLMs reveal critical, previously unreported deficiencies in Chinese medical fact-checking performance. MedFact is publicly released, establishing the first benchmark and reproducible evaluation standard for this domain, thereby advancing the development of safe, trustworthy medical AI systems.

Technology Category

Application Category

📝 Abstract

Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.

Problem

Research questions and friction points this paper is trying to address.

Verifying medical content generated by large language models

Addressing the gap in Chinese evidence-based medical fact-checking datasets

Evaluating LLM performance on complex real-world medical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese dataset for LLM medical fact-checking

Contains 1321 questions and 7409 medical claims

Tests both in-context learning and fine-tuning approaches

🔎 Similar Papers

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models