GISA: A Benchmark for General Information-Seeking Assistant

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of existing information retrieval agent benchmarks, which often rely on synthetically reverse-engineered queries and static answer sets, resulting in unnatural tasks and narrow scenarios that fail to capture real-world complexity. To bridge this gap, the authors introduce InfoSeek, a general-purpose benchmark comprising 373 human-authored information-seeking queries that require deep reasoning and broad information aggregation. InfoSeek supports four structured output formats—item, set, list, and table—and uniquely incorporates complete human search trajectories alongside a dynamic answer validation mechanism. The benchmark enables both process-level evaluation and imitation learning. Evaluations on leading large language models and commercial search systems reveal a best exact-match accuracy of only 19.30%, with particularly poor performance on tasks demanding complex planning and integrative information gathering, underscoring significant limitations in current systems.

Technology Category

Application Category

📝 Abstract

The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.

Problem

Research questions and friction points this paper is trying to address.

information-seeking agents

benchmark evaluation

data contamination

realistic queries

search trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

information-seeking agent

benchmark

structured evaluation