MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

๐Ÿ“… 2025-09-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing Chinese medical fact-checking benchmarks suffer from narrow coverage and weak factual grounding, failing to reflect clinical complexity. To address this, we propose MedFactโ€”the first fine-grained, Chinese-language medical fact-checking benchmark, covering 13 specialties, 8 error types, and diverse textual styles. It employs a hybrid annotation paradigm combining AI-assisted pre-screening with multi-round expert collaboration to ensure high data quality and challenge level. Methodologically, we introduce a multi-criteria filtering strategy, iterative expert feedback integration, and a multi-agent reasoning-based evaluation framework. Experimental evaluation of 20 state-of-the-art LLMs reveals that while models exhibit basic error detection capability, they fall significantly short in precise error localization (F1 = 0.42 vs. expert 0.91) and plausibility assessment, frequently exhibiting โ€œover-criticism.โ€ MedFact establishes a standardized, clinically grounded benchmark for rigorous evaluation of medical LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factual reliability of LLMs in Chinese medical contexts
Addressing limitations of narrow-domain medical fact-checking benchmarks
Assessing model performance on error detection and localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid AI-human framework construction
Multi-criteria filtering process refinement
Expert-annotated instances across specialties
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiayi He
Xunfei Healthcare Technology Co., Ltd.
Y
Yangmin Huang
Xunfei Healthcare Technology Co., Ltd.
Q
Qianyun Du
Xunfei Healthcare Technology Co., Ltd.
X
Xiangying Zhou
Xunfei Healthcare Technology Co., Ltd.
Zhiyang He
Zhiyang He
Massachusetts Institute of Technology
Quantum Information
J
Jiaxue Hu
Xunfei Healthcare Technology Co., Ltd.
X
Xiaodong Tao
Xunfei Healthcare Technology Co., Ltd.
L
Lixian Lai
Xunfei Healthcare Technology Co., Ltd.