The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses hallucination in large language models (LLMs) during long-document understanding (≤32K tokens) and factually grounded generation. Methodologically, it introduces the first factuality-grounded evaluation benchmark specifically designed for long-context settings. It proposes a two-stage automated adjudication framework: first verifying user request satisfaction, then employing multiple judge models to collaboratively assess response grounding accuracy against the context. To mitigate bias, it adopts a robust aggregation scoring mechanism and partitions data into public and private splits to ensure leaderboard fairness and sustainable evolution. Technically, it integrates multi-template prompt robustness evaluation, long-context processing capabilities, and fine-grained grounding verification. The benchmark is open-sourced with a dynamically maintained leaderboard on Kaggle, significantly enhancing reproducibility and practical utility of LLM factuality assessment.

Technology Category

Application Category

📝 Abstract
We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Long Text Comprehension
Writing Capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

FACTS Grounding
Large Language Model Assessment
Long Text Understanding and Generation
🔎 Similar Papers
No similar papers found.
Alon Jacovi
Alon Jacovi
Google Research
Machine LearningNatural Language ProcessingExplainable Artificial Intelligence
Andrew Wang
Andrew Wang
University of Toronto, Vector Institute
AI Safety
Chris Alberti
Chris Alberti
google.com
C
Connie Tao
Google DeepMind
Jon Lipovetz
Jon Lipovetz
Software Engineer, Google
K
Kate Olszewska
Google DeepMind
L
Lukas Haas
Google DeepMind
Michelle Liu
Michelle Liu
Google Cloud
N
Nate Keating
Kaggle
Adam Bloniarz
Adam Bloniarz
Google
Causal inferenceMachine learning
C
Carl Saroufim
Google Cloud
C
Corey Fry
Google Research
D
Doron Kukliansky
Google Research
Gaurav Singh Tomar
Gaurav Singh Tomar
Google Deepmind
Natural Language processingLanguage Technologies and Artificial Intelligence in Education
J
James Swirhun
Google Cloud
J
Jinwei Xing
Google Cloud
Lily Wang
Lily Wang
George Mason University
Statistical Machine LearningNonparametric MethodsComplex DataBig Data Analytics
M
Madhu Gurumurthy
Google Cloud
M
Michael Aaron
Kaggle
M
Moran Ambar
Google Research
R
Rachana Fellinger
Google Research
R
Rui Wang
Google Cloud
Z
Zizhao Zhang
Google Cloud
S
S. Goldshtein
Google Research
D
Dipanjan Das
Google DeepMind