The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the lack of a unified, cross-scenario benchmark for evaluating factual accuracy in large language models (LLMs). We propose the first comprehensive evaluation framework covering four distinct factual reasoning scenarios: multimodal question answering, closed-book fact reasoning, search-augmented response generation, and document-provenance-based generation. Methodologically, we introduce a novel four-dimensional decoupled evaluation paradigm, integrating an upgraded FACTS Grounding v2, a multimodal fact verification mechanism, and a dual leaderboard (public/private) design. Our framework leverages LLM-as-judge automated assessment, API-enhanced retrieval validation, and citation-based provenance analysis to enable cross-scenario comparable, fine-grained, and interpretable quantification of factual consistency. The benchmark is publicly released on Kaggle, supporting fair, dynamic, and reproducible factual evaluation across diverse LLMs.

Technology Category

Application Category

📝 Abstract

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM factuality across multimodal, parametric, search, and grounding scenarios.

Measures factual accuracy in image-based questions and closed-book knowledge retrieval.

Assesses long-form response grounding in documents and information-seeking with search.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online leaderboard suite with four sub-leaderboards for factuality

Automated judge models score responses across multimodal and search scenarios

Public-private splits maintain integrity while enabling external participation

🔎 Similar Papers

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs

2024-05-09Citations: 7

Factual consistency evaluation of summarization in the Era of large language models

2024-02-21Citations: 0