🤖 AI Summary
This study addresses the lack of empirical analysis on search behaviors of large language model (LLM)-driven agents in real-world multi-turn retrieval, particularly regarding conversational intent evolution and evidence utilization mechanisms. Leveraging 14.44 million real search logs, the authors segment conversations and employ LLM-assisted annotation to label intents and query reformulations, proposing a Context-Driven Term Adoption Rate (CTAR) to quantify evidence traceability. The work reveals key dynamic characteristics of authentic agent-driven search: 90% of conversations span no more than 10 turns, 89% of consecutive steps occur within one minute, and 54% of new query terms can be traced back to accumulated evidence. Furthermore, it demonstrates that intent types significantly influence patterns of exploration versus repetition, offering critical empirical insights for optimizing agent-based search systems.
📝 Abstract
LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is used. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e. an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90% of multi-turn sessions contain at most ten steps, and 89% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, agents reuse evidence across steps. On average, 54% of newly introduced query terms appear in the accumulated evidence context, with contributions from earlier steps beyond the most recent retrieval. The findings suggest that agentic search may benefit from repetition-aware early stopping, intent-adaptive retrieval budgets, and explicit cross-step context tracking. We plan to release the anonymized logs to support future research.