Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the overestimation of long-context model performance in existing benchmarks like NIAH, which lack semantic interference and thus fail to assess a model’s ability to accurately retrieve and utilize evidence in complex scenarios. To this end, we propose EverMemBench-S, an adversarial benchmark built upon a 326M-token MemoryBank that introduces semantically similar hard negatives and multi-document evidence sets, enabling the first decoupled evaluation of evidence localization and answer quality. The benchmark features adversarially designed query–evidence pairs, validated through both human curation and LLM-based verification, and supports unified evaluation of both native long-context models and RAG systems. Experiments reveal that while models excel on NIAH, their evidence retrieval capability degrades significantly under semantic interference, highlighting semantic discrimination as a core bottleneck in long-context memory.

Technology Category

Application Category

📝 Abstract

Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.

Problem

Research questions and friction points this paper is trying to address.

long-context LLM

semantic interference

evidence access

Needle-in-a-Haystack

retrieval evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic interference

decoupled evaluation

long-context LLM