๐ค AI Summary
Existing Chinese medical fact-checking benchmarks suffer from narrow coverage and weak factual grounding, failing to reflect clinical complexity. To address this, we propose MedFactโthe first fine-grained, Chinese-language medical fact-checking benchmark, covering 13 specialties, 8 error types, and diverse textual styles. It employs a hybrid annotation paradigm combining AI-assisted pre-screening with multi-round expert collaboration to ensure high data quality and challenge level. Methodologically, we introduce a multi-criteria filtering strategy, iterative expert feedback integration, and a multi-agent reasoning-based evaluation framework. Experimental evaluation of 20 state-of-the-art LLMs reveals that while models exhibit basic error detection capability, they fall significantly short in precise error localization (F1 = 0.42 vs. expert 0.91) and plausibility assessment, frequently exhibiting โover-criticism.โ MedFact establishes a standardized, clinically grounded benchmark for rigorous evaluation of medical LLMs.
๐ Abstract
The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.