🤖 AI Summary
In open-book question answering, existing evaluation methods suffer from bias, poor scalability, and reliance on external systems, hindering accurate measurement of a model’s contextual dependency. To address this, we propose ConSens—a novel metric that quantifies a language model’s “context anchoring ability” by measuring the relative perplexity difference between context-aware and context-agnostic generations—using the model itself as both evaluator and generator. ConSens requires no fine-tuning, eliminates dependence on external judge models, and supports zero-shot, cross-model evaluation with high interpretability. Extensive experiments across multiple datasets demonstrate that ConSens achieves strong correlation with human judgments (Spearman’s ρ > 0.85), significantly outperforms baselines such as LLM-as-a-judge in discriminative power, and reduces computational overhead by over 90%.
📝 Abstract
Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model's answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.