MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

182K/year
πŸ€– AI Summary
Existing conversational retrieval evaluation benchmarks rely either on costly human annotations or unnatural heuristic methods, limiting their effectiveness in assessing retrieval-augmented generation systems. This work proposes MTR-Suite, a unified framework that introduces an integrated paradigm combining auditing, synthesis, and evaluation. It leverages a multi-agent high-fidelity dialogue synthesis pipeline and a large language model–based alignment analyzer, augmented with a greedy traversal clustering algorithm to generate high-quality conversational data. The framework yields MTR-Bench, a general-domain evaluation benchmark specifically designed to address real-world challenges such as topic shifts and verbose distractions. Requiring only 1/400th of the human annotation cost of conventional approaches, MTR-Suite produces production-grade evaluation data with strong discriminative power, significantly outperforming existing benchmarks across multiple dimensions.
πŸ“ Abstract
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.
Problem

Research questions and friction points this paper is trying to address.

conversational retrieval
benchmark
evaluation
human annotation
automated heuristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational retrieval
retrieval-augmented generation
benchmark synthesis
multi-agent dialogue generation
LLM-based evaluation