MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing conversational retrieval evaluation benchmarks rely either on costly human annotations or unnatural heuristic methods, limiting their effectiveness in assessing retrieval-augmented generation systems. This work proposes MTR-Suite, a unified framework that introduces an integrated paradigm combining auditing, synthesis, and evaluation. It leverages a multi-agent high-fidelity dialogue synthesis pipeline and a large language model–based alignment analyzer, augmented with a greedy traversal clustering algorithm to generate high-quality conversational data. The framework yields MTR-Bench, a general-domain evaluation benchmark specifically designed to address real-world challenges such as topic shifts and verbose distractions. Requiring only 1/400th of the human annotation cost of conventional approaches, MTR-Suite produces production-grade evaluation data with strong discriminative power, significantly outperforming existing benchmarks across multiple dimensions.

📝 Abstract

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

Problem

Research questions and friction points this paper is trying to address.

conversational retrieval

benchmark

evaluation

human annotation

automated heuristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational retrieval

retrieval-augmented generation

benchmark synthesis

multi-agent dialogue generation

LLM-based evaluation

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

2024-09-19arXiv.orgCitations: 0