🤖 AI Summary
This paper identifies a critical vulnerability in current search-augmented language models (SALMs): their reasoning robustness and factual accuracy degrade substantially when confronted with conflicting, noisy, or irrelevant search results—e.g., state-of-the-art models achieve only 17.1% accuracy on Seal-0, and even strong reasoning models like o3-mini are easily misled by noise. To systematically evaluate retrieval-verification-reasoning capabilities under adversarial web search, the authors introduce SealQA—a novel benchmark comprising three challenge categories: standard (Seal-0), high-difficulty (Seal-Hard), and long-context multi-document reasoning (LongSeal). SealQA pioneers the “needle-in-a-haystack” long-document reasoning evaluation paradigm and reveals that increasing test-time compute does not reliably improve performance. Constructed from real-world search outputs, SealQA integrates multi-document ranking, fact verification, and chain-of-thought reasoning assessment protocols. The benchmark and its Hugging Face–based evaluation framework are publicly released.
📝 Abstract
We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in"needle-in-a-haystack"settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the"lost-in-the-middle"issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.