$ฯ„$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

๐Ÿ“… 2026-03-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluation benchmarks struggle to effectively assess conversational agentsโ€™ ability to integrate unstructured knowledge with tool invocation over extended interactions, particularly in high-complexity domains such as financial customer service. To address this gap, this work proposes the ฯ„-Knowledge evaluation framework, which extends ฯ„-Bench by introducing the ฯ„-Banking domainโ€”a simulated environment comprising over 700 interconnected unstructured documents. Agents are required to perform end-to-end tasks involving knowledge retrieval, compliant tool usage, and state validation. This framework is the first to unify unstructured knowledge grounding, tool utilization, and policy compliance within long-horizon dialogue evaluation. Experimental results reveal that state-of-the-art large language models achieve only a 25.5% pass rate on this benchmark, with reliability further degrading significantly upon repeated testing, highlighting critical deficiencies in complex knowledge reasoning and robust execution.

Technology Category

Application Category

๐Ÿ“ Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $ฯ„$-Knowledge, an extension of $ฯ„$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $ฯ„$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $ฯ„$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
Problem

Research questions and friction points this paper is trying to address.

conversational agents
unstructured knowledge
knowledge retrieval
agent evaluation
long-horizon interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational agents
unstructured knowledge
tool-augmented reasoning
agent evaluation
knowledge retrieval
๐Ÿ”Ž Similar Papers
No similar papers found.