SuiteEval: Simplifying Retrieval Benchmarks

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragmentation in information retrieval evaluation caused by disparate data subsets, aggregation methods, and pipeline configurations, which severely hinders reproducibility and cross-dataset model comparison. To this end, we propose a unified end-to-end evaluation framework that requires users only to supply a retrieval pipeline generator; the system then automatically handles data loading, index construction, ranking, metric computation, and result aggregation. The framework introduces a novel dynamic index reuse mechanism to reduce disk overhead and enables one-line extension to new benchmarks such as BEIR, LoTTE, and MS MARCO. This design substantially minimizes redundant evaluation code while enhancing standardization, efficiency, and extensibility, thereby advancing reproducible research in information retrieval.

Technology Category

Application Category

📝 Abstract
Information retrieval evaluation often suffers from fragmented practices -- varying dataset subsets, aggregation methods, and pipeline configurations -- that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.
Problem

Research questions and friction points this paper is trying to address.

information retrieval evaluation
reproducibility
comparability
foundation embedding models
benchmark fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SuiteEval
retrieval benchmarking
dynamic indexing
foundation embedding models
reproducible evaluation
🔎 Similar Papers
No similar papers found.