🤖 AI Summary
Large language models (LLMs) excel at long-context understanding and information localization but fundamentally lack “absence awareness”—the ability to detect explicitly missing content in documents. This work introduces AbsenceBench, the first systematic benchmark for evaluating LLMs’ capacity to identify deliberately omitted information across three domains: numeric sequences, poetry, and GitHub Pull Requests. Methodologically, we propose a contrastive reasoning framework comparing original and edited documents, evaluated via F1 score and attention attribution analysis under zero-shot settings across leading closed- and open-source models. Results reveal severe limitations: even the strongest model, Claude-3.7-Sonnet (5K-token average context), achieves only 69.6% average F1—substantially below performance on existence detection (e.g., NIAH), confirming that Transformer attention mechanisms struggle to represent implicit “gaps” absent explicit tokens. This work establishes absence awareness as a novel capability dimension and fills a critical gap in information integrity evaluation.
📝 Abstract
Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to"gaps"in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).