Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Needle-in-a-Haystack (NIAH) benchmarks evaluate only idealized long-context retrieval, neglecting real-world noise arising from heterogeneous, biased retrieval and cascading agent-level errors. Method: We propose HaystackCraft, the first benchmark for noisy long-context evaluation—termed “Haystack Engineering”—that jointly models retrieval biases (sparse, dense, hybrid, and graph-structured) and dynamic agent behaviors (query optimization, reflective reasoning, and stopping decisions). It constructs multi-hop QA and self-reflective reasoning tasks grounded in Wikipedia’s hyperlink network. Contribution/Results: Experiments reveal that mainstream LLMs suffer from cascading failures induced by self-generated distractors and struggle to terminate early. Crucially, graph-structured re-ranking simultaneously improves retrieval quality and suppresses harmful noise, demonstrating its efficacy in mitigating interference under realistic conditions.

Technology Category

Application Category

📝 Abstract
Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM robustness in noisy long-context environments with biased retrievers
Assessing cascading errors in agentic workflows involving dynamic reasoning
Testing model performance on multi-hop questions with heterogeneous distractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops HaystackCraft benchmark using Wikipedia hyperlink network
Evaluates heterogeneous retrieval strategies and distractor composition
Simulates agentic workflows with dynamic query refinement
🔎 Similar Papers
No similar papers found.