Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data

πŸ“… 2026-01-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of effective benchmarks for real-world, high-capacity structured data in automated fact-checking, a field often limited to small, manually curated tables. The authors propose a semantic frame–guided method to synthesize 78,503 multilingual claims from 434 complex OECD tables, each averaging over 500,000 rows, thereby establishing the first large-scale tabular fact-checking benchmark. Their approach integrates programmatic data point selection, six types of semantic frames, and multilingual generation to ensure claim verifiability while avoiding memorization by large language models, thus compelling models to rely on retrieval and reasoning. Experimental results reveal that current models perform poorly on this benchmark, highlighting evidence retrieval as a critical bottleneck and offering a valuable resource and new direction for research in fact-checking over large-scale tabular data.

Technology Category

Application Category

πŸ“ Abstract
Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.
Problem

Research questions and friction points this paper is trying to address.

fact-checking
high-volume tabular data
structured data verification
large-scale benchmark
evidence retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-guided generation
synthetic claim generation
large-scale tabular data
multilingual fact-checking
evidence retrieval
πŸ”Ž Similar Papers
No similar papers found.
J
J. Devasier
The University of Texas at Arlington
A
A. Putta
The University of Texas at Arlington
Q
Qing Wang
The University of Texas at Arlington
A
Alankrit Moses
The University of Texas at Arlington
Chengkai Li
Chengkai Li
Professor of Computer Science and Engineering, The University of Texas at Arlington
Big Data & Data ScienceComputational JournalismData-Driven Fact-CheckingNatural Language Processing