ConSens: Assessing context grounding in open-book question answering

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

In open-book question answering, existing evaluation methods suffer from bias, poor scalability, and reliance on external systems, hindering accurate measurement of a model’s contextual dependency. To address this, we propose ConSens—a novel metric that quantifies a language model’s “context anchoring ability” by measuring the relative perplexity difference between context-aware and context-agnostic generations—using the model itself as both evaluator and generator. ConSens requires no fine-tuning, eliminates dependence on external judge models, and supports zero-shot, cross-model evaluation with high interpretability. Extensive experiments across multiple datasets demonstrate that ConSens achieves strong correlation with human judgments (Spearman’s ρ > 0.85), significantly outperforms baselines such as LLM-as-a-judge in discriminative power, and reduces computational overhead by over 90%.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model's answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating context reliance in open-book QA models

Addressing biases in LLM-as-a-judge assessment methods

Proposing efficient perplexity-based metric for context grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel metric contrasts perplexity with/without context

Computationally efficient and interpretable evaluation method

Scalable solution for assessing context utilization

🔎 Similar Papers

Cost-Effective, High-Performance Open-Source LLMs via Optimized Context Retrieval