π€ AI Summary
Existing conversational retrieval evaluation benchmarks rely either on costly human annotations or unnatural heuristic methods, limiting their effectiveness in assessing retrieval-augmented generation systems. This work proposes MTR-Suite, a unified framework that introduces an integrated paradigm combining auditing, synthesis, and evaluation. It leverages a multi-agent high-fidelity dialogue synthesis pipeline and a large language modelβbased alignment analyzer, augmented with a greedy traversal clustering algorithm to generate high-quality conversational data. The framework yields MTR-Bench, a general-domain evaluation benchmark specifically designed to address real-world challenges such as topic shifts and verbose distractions. Requiring only 1/400th of the human annotation cost of conventional approaches, MTR-Suite produces production-grade evaluation data with strong discriminative power, significantly outperforming existing benchmarks across multiple dimensions.
π Abstract
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.