RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing benchmarks focus on short, factoid question answering, overlooking large language models’ (LLMs) ability to generate structured, multi-record tables from parametric knowledge. We identify relational fact retrieval as substantially more challenging than pointwise querying, with model performance highly sensitive to output dimensions (e.g., number of attributes or records), revealing novel failure modes. Method: We introduce the first benchmark specifically designed to evaluate LLMs’ capability to generate relational tables—mapping natural language questions to structured tabular answers, accompanied by corresponding SQL queries. Our decomposable evaluation framework jointly considers query complexity, output scale, and data characteristics; we generate high-quality ground-truth tables via SQL–natural language alignment and construct a multidimensional, controllable test suite with automated factual verification. Contribution/Results: Experiments show state-of-the-art models achieve ≤25% factual accuracy, with sharp degradation as output dimensions increase—exposing a fundamental bottleneck in structured knowledge synthesis.

Technology Category

Application Category

📝 Abstract

Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs' ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.

Problem

Research questions and friction points this paper is trying to address.

Evaluating structured tabular fact retrieval from LLMs

Assessing multi-record outputs from parametric knowledge

Measuring LLM performance on relational fact accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RelationalFactQA benchmark for tabular retrieval

Evaluates LLMs on structured multi-record outputs

Measures performance across varying query complexities

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering