🤖 AI Summary
This paper introduces Explainable Dataset Search via Examples (Explainable DSE), a novel task that unifies keyword-based and target-dataset-example–driven retrieval while requiring models to jointly identify both metadata and content fields that substantiate relevance judgments—thereby enhancing search transparency and interpretability. To support this task, we construct DSEBench, the first high-quality benchmark featuring dual-level annotations at both the dataset and field levels. We further propose an end-to-end framework integrating sparse retrieval, dense retrieval, LLM-based re-ranking, and explanation generation. To address data scarcity, we leverage LLMs to synthesize large-scale training instances. Comprehensive baselines established on DSEBench demonstrate the effectiveness of multi-strategy synergy. Our work establishes the first standardized evaluation platform and methodological paradigm for explainable dataset search.
📝 Abstract
Dataset search has been an established information retrieval task. Current paradigms either retrieve datasets that are relevant to a keyword query or find datasets that are similar to an input target dataset. To allow for their combined specification of information needs, in this article, we investigate the more generalized task of Dataset Search with Examples (DSE) and further extend it to Explainable DSE that requires identifying the metadata and content fields of a dataset that indicate its relevance to the query and similarity to the target datasets. To facilitate this research, we construct DSEBench, a test collection that provides high-quality dataset- and field-level annotations to enable the evaluation of explainable DSE. We also employ a large language model to generate numerous annotations to be used for training. We establish extensive baselines on DSEBench by adapting and evaluating a variety of sparse, dense, and LLM-based retrieval, reranking, and explanation methods.