CUB: Benchmarking Context Utilisation Techniques for Language Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Language models (LMs) often suffer from outdated parametric knowledge or irrelevant contextual interference in knowledge-intensive tasks, yet existing context manipulation techniques (CMTs) lack systematic evaluation. Method: We introduce CUB—the first benchmark for evaluating context utilization in retrieval-augmented generation (RAG)—featuring a multi-dimensional taxonomy of context types (ground-truth, synthetic, and distracting) and a standardized evaluation protocol across nine mainstream LMs and three knowledge-intensive task categories. Contribution/Results: Our systematic assessment of seven representative CMTs reveals severe generalization deficits: average performance drops by 23% on natural data, while synthetic contexts induce misleadingly inflated scores. CUB is the first benchmark to empirically identify robustness bottlenecks in CMTs, establishing a foundational evaluation infrastructure for trustworthy RAG systems.

Technology Category

Application Category

📝 Abstract

Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.

Problem

Research questions and friction points this paper is trying to address.

Evaluates context utilisation techniques for language models

Compares performance across diverse datasets and tasks

Highlights gaps in handling real-world retrieval scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops CUB benchmark for context utilisation techniques

Evaluates seven state-of-the-art CMT methods

Tests CMTs on three realistic context types

🔎 Similar Papers

No similar papers found.