Trove: A Flexible Toolkit for Dense Retrieval

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing dense retrieval toolkits suffer from poor usability, limited flexibility, and weak scalability: they require redundant storage of multiple dataset copies for dynamic configuration; lack customizable components and a unified low-code framework for evaluation and hard negative mining; and provide insufficient multi-node distributed support. This paper introduces Trove—the first open-source dense retrieval toolkit supporting dynamic data operations (filtering, selection, transformation, and composition). Trove employs a modular architecture and an efficient dynamic data management mechanism, enabling complex dataset configurations in just a few lines of code while reducing memory footprint by 2.6×. Its low-code pipeline framework allows seamless substitution of user-defined components and natively integrates multi-node parallel training and inference, achieving near-linear inference speedup with increasing node count. These innovations significantly accelerate exploratory research and experimental iteration.

Technology Category

Application Category

📝 Abstract

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.

Problem

Research questions and friction points this paper is trying to address.

Simplifying dense retrieval experiments with flexible toolkit

Enabling efficient data management without multiple dataset copies

Providing customizable components and unified evaluation pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient on-the-fly data management reduces memory usage

Highly customizable toolkit with replaceable components

Low-code unified pipeline supports multi-node execution

🔎 Similar Papers

A Comprehensive Survey on Retrieval Methods in Recommender Systems