Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

πŸ“… 2026-05-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

197K/year
πŸ“ Abstract
Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.
Problem

Research questions and friction points this paper is trying to address.

agentic search
retrieval bottleneck
corpus interaction
semantic similarity
multi-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Corpus Interaction
Agentic Search
Retrieval Interface
Terminal-based Retrieval
Multi-step Reasoning
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhuofeng Li
Texas A&M University
Haoxiang Zhang
Haoxiang Zhang
Queen’s University
Software EngineeringEmpirical Software EngineeringMining Software Repositories
Cong Wei
Cong Wei
University of Waterloo
ReasoningDiffusionEfficiency
Pan Lu
Pan Lu
Stanford University
Machine LearningNatural Language ProcessingMachine ReasoningMathematical Reasoning
Ping Nie
Ping Nie
Waterloo University
Natural Language ProcessingInformation RetrievalRecommendation SystemsTime Series Forecasting
Y
Yi Lu
University of Waterloo
Yuyang Bai
Yuyang Bai
Texas A&M University
Natural Lanuage ProcessingLarge Language Models
Shangbin Feng
Shangbin Feng
University of Washington
natural language processingsocial network analysisknowledge bases
H
Hangxiao Zhu
Texas A&M University
Ming Zhong
Ming Zhong
University of Illinois Urbana-Champaign
Natural Language Processing
Yuyu Zhang
Yuyu Zhang
Research Scientist, ByteDance
Machine Learning
Jianwen Xie
Jianwen Xie
Research Scientist
Generative ModelsAI for ScienceComputer Vision
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
Jiawei Han
Jiawei Han
Abel Bliss Professor of Computer Science, University of Illinois
data miningdatabase systemsdata warehousinginformation networks
Wenhu Chen
Wenhu Chen
Assistant Professor at University of Waterloo
Natural Language ProcessingArtificial IntelligenceDeep Learning
Jimmy Lin
Jimmy Lin
University of Waterloo
information retrievalnatural language processingdata managementbig data
Dongfu Jiang
Dongfu Jiang
University of Waterloo
Large Language ModelMultimodality ReasoningEvaluation
Yu Zhang
Yu Zhang
Texas A&M University
Data MiningNatural Language ProcessingAI4Science