MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing Chinese medical fact-checking benchmarks suffer from narrow coverage and weak factual grounding, failing to reflect clinical complexity. To address this, we propose MedFact—the first fine-grained, Chinese-language medical fact-checking benchmark, covering 13 specialties, 8 error types, and diverse textual styles. It employs a hybrid annotation paradigm combining AI-assisted pre-screening with multi-round expert collaboration to ensure high data quality and challenge level. Methodologically, we introduce a multi-criteria filtering strategy, iterative expert feedback integration, and a multi-agent reasoning-based evaluation framework. Experimental evaluation of 20 state-of-the-art LLMs reveals that while models exhibit basic error detection capability, they fall significantly short in precise error localization (F1 = 0.42 vs. expert 0.91) and plausibility assessment, frequently exhibiting “over-criticism.” MedFact establishes a standardized, clinically grounded benchmark for rigorous evaluation of medical LLMs.

Technology Category

Application Category

📝 Abstract

The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating factual reliability of LLMs in Chinese medical contexts

Addressing limitations of narrow-domain medical fact-checking benchmarks

Assessing model performance on error detection and localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid AI-human framework construction

Multi-criteria filtering process refinement

Expert-annotated instances across specialties

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models