ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

📅 2024-10-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context evaluation methods (e.g., “needle-in-a-haystack”) only assess local information retrieval and thus fail to measure models’ ability to leverage the full context, compromising evaluation reliability. To address this, we propose ETHIC—the first long-context benchmark with high Information Coverage (IC), a novel metric defined as the proportion of context essential for correct reasoning, quantifying how thoroughly a model depends on the entire document. ETHIC comprises 1,986 expert-annotated QA instances across four domains—literature, debate, medicine, and law—each rigorously labeled with its minimal necessary context. We further introduce a multi-task construction paradigm, an IC-driven evaluation protocol, and a systematic assessment framework. Experiments reveal substantial performance degradation across mainstream LLMs on high-IC tasks, underscoring their limited capacity for holistic context utilization. ETHIC is publicly released to establish a more rigorous and trustworthy standard for evaluating long-context modeling capabilities.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs on long-context tasks
Assess information coverage in benchmarks
Introduce ETHIC for comprehensive context utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces information coverage (IC) metric
Develops ETHIC benchmark for LLMs
Tests LLMs on high IC tasks
Taewhoo Lee
Taewhoo Lee
Korea University
C
Chanwoong Yoon
Korea University
Kyochul Jang
Kyochul Jang
Seoul National University
LLMHAIGraph
D
Donghyeon Lee
Korea University
M
Minju Song
Korea University
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
J
Jaewoo Kang
Korea University, AIGEN Sciences