🤖 AI Summary
This work addresses the scarcity of verifiable, document-grounded reasoning traces in existing datasets, which hinders the development of trustworthy large language agents. To bridge this gap, we introduce AgentSim, an open-source platform that simulates retrieval-augmented generation (RAG) agents over arbitrary document collections to produce evidence-based, step-by-step reasoning trajectories. Our framework incorporates Corpus-Aware Seeding to enhance diversity and an Active Validation mechanism that combines multi-model consistency checks with human-in-the-loop annotation to ensure high-quality outputs. We present the first fully traceable Agent-Trace Corpus (ATC), comprising 103,000 reasoning steps, achieving 100% document traceability for all answers across three information retrieval benchmarks. Furthermore, our analysis reveals systematic differences in retrieval behaviors among state-of-the-art models.
📝 Abstract
Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.