The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses hallucination in large language models (LLMs) during long-document understanding (≤32K tokens) and factually grounded generation. Methodologically, it introduces the first factuality-grounded evaluation benchmark specifically designed for long-context settings. It proposes a two-stage automated adjudication framework: first verifying user request satisfaction, then employing multiple judge models to collaboratively assess response grounding accuracy against the context. To mitigate bias, it adopts a robust aggregation scoring mechanism and partitions data into public and private splits to ensure leaderboard fairness and sustainable evolution. Technically, it integrates multi-template prompt robustness evaluation, long-context processing capabilities, and fine-grained grounding verification. The benchmark is open-sourced with a dynamically maintained leaderboard on Kaggle, significantly enhancing reproducibility and practical utility of LLM factuality assessment.

Technology Category

Application Category

📝 Abstract

We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Long Text Comprehension

Writing Capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

FACTS Grounding

Large Language Model Assessment

Long Text Understanding and Generation

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time