Enhancing Health Fact-Checking with LLM-Generated Synthetic Data

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of human-annotated data in health-related fact-checking, this paper proposes a large language model (LLM)-based synthetic data generation method. First, health documents are summarized and decomposed into atomic facts; then, a sentence-fact entailment table is constructed to automatically generate text-claim pairs with ground-truth labels. The approach ensures high data quality, interpretability, and domain adaptability. Subsequently, the synthetic data is jointly used with real annotated data to fine-tune a BERT-based classifier. Experiments on PubHealth and SciFact demonstrate F1-score improvements of 1.9% and 4.9%, respectively, significantly alleviating the training data scarcity problem in low-resource health fact-checking. This work establishes a novel paradigm for controllable, LLM-driven synthetic data generation tailored to specialized domains.

Technology Category

Application Category

📝 Abstract
Fact-checking for health-related content is challenging due to the limited availability of annotated training data. In this study, we propose a synthetic data generation pipeline that leverages large language models (LLMs) to augment training data for health-related fact checking. In this pipeline, we summarize source documents, decompose the summaries into atomic facts, and use an LLM to construct sentence-fact entailment tables. From the entailment relations in the table, we further generate synthetic text-claim pairs with binary veracity labels. These synthetic data are then combined with the original data to fine-tune a BERT-based fact-checking model. Evaluation on two public datasets, PubHealth and SciFact, shows that our pipeline improved F1 scores by up to 0.019 and 0.049, respectively, compared to models trained only on the original data. These results highlight the effectiveness of LLM-driven synthetic data augmentation in enhancing the performance of health-related fact-checkers.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited annotated training data for health fact-checking
Generating synthetic data using large language models
Improving fact-checking model performance with data augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic data augmentation
Sentence-fact entailment table construction
BERT-based model fine-tuning enhancement
🔎 Similar Papers
No similar papers found.
J
Jingze Zhang
Population Health Sciences, Weill Cornell Medicine, New York, 10022, US
J
Jiahe Qian
Population Health Sciences, Weill Cornell Medicine, New York, 10022, US
Yiliang Zhou
Yiliang Zhou
University of California, Irvine
NLPAI in healthcareLLM
Y
Yifan Peng
Population Health Sciences, Weill Cornell Medicine, New York, 10022, US