Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

In data-driven research, efficiently retrieving task-appropriate datasets from high-level task descriptions remains challenging due to ambiguous user intent, weak task-dataset alignment, lack of dedicated benchmarks, and entity ambiguity. Method: We propose KATS, an end-to-end task-oriented dataset search system, introducing (i) a novel task-dataset knowledge graph (TDKG) co-constructed by collaborative multi-agents; (ii) a semantic-driven framework for task entity linking and dataset entity resolution; and (iii) CS-TDS—the first specialized benchmark for task-driven dataset search. KATS integrates multi-agent information extraction, dynamic TDKG construction, vector-based retrieval, and graph-aware re-ranking. Contribution/Results: On CS-TDS, KATS significantly outperforms state-of-the-art RAG baselines in both retrieval accuracy and efficiency, demonstrating scalability and robustness. It establishes a new paradigm and technical blueprint for extensible, semantics-aware dataset discovery.

Technology Category

Application Category

📝 Abstract

The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguous user intent in dataset search

Fills task-to-dataset mapping and benchmark gaps

Resolves entity ambiguity in task-oriented retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework constructs task-dataset knowledge graph

Semantic mechanism resolves entity ambiguity via linking

Hybrid query engine combines vector search with graph ranking

🔎 Similar Papers

No similar papers found.