AbsenceBench: Language Models Can't Tell What's Missing

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models (LLMs) excel at long-context understanding and information localization but fundamentally lack “absence awareness”—the ability to detect explicitly missing content in documents. This work introduces AbsenceBench, the first systematic benchmark for evaluating LLMs’ capacity to identify deliberately omitted information across three domains: numeric sequences, poetry, and GitHub Pull Requests. Methodologically, we propose a contrastive reasoning framework comparing original and edited documents, evaluated via F1 score and attention attribution analysis under zero-shot settings across leading closed- and open-source models. Results reveal severe limitations: even the strongest model, Claude-3.7-Sonnet (5K-token average context), achieves only 69.6% average F1—substantially below performance on existence detection (e.g., NIAH), confirming that Transformer attention mechanisms struggle to represent implicit “gaps” absent explicit tokens. This work establishes absence awareness as a novel capability dimension and fills a critical gap in information integrity evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to"gaps"in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to detect missing information

Evaluating performance on identifying deliberately removed content

Analyzing Transformer limitations in attending to document gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

AbsenceBench tests LLMs on missing information detection

Transformer attention struggles with document gaps

Models fail despite original and edited context access

🔎 Similar Papers

No similar papers found.