Rethinking Dataset Discovery with DataScout

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

In data science, task-oriented dataset discovery faces challenges including implicit user preferences, opaque search spaces, and ambiguous relevance criteria—leading to inefficient query iteration. This paper proposes an AI-augmented dataset search framework that explicitly models user intent via AI-driven query reformulation; enables fine-grained content understanding through joint column- and row-level semantic analysis; and introduces a task-driven relevance metric with dynamically generated, interpretable feedback. The framework supports users in progressively constructing a cognitive model of the search space, thereby enhancing both interpretability and interactive efficiency. Experiments demonstrate that, compared to keyword-based and conventional semantic search methods, our approach improves structured exploration efficiency by 37% and reduces query iterations by 42%. Its effectiveness and generalizability are validated across multiple real-world data science tasks.

Technology Category

Application Category

📝 Abstract

Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g. granularity, attributes, size), semantics (e.g., data semantics, creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive -- users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria -- making query iteration challenging. To bridge these gaps, we introduce DataScout to proactively steer users through the process of dataset discovery via -- (i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators, generated dynamically based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search reveals that users uniquely employ DataScout's features not only for structured explorations, but also to glean feedback on their search queries and build conceptual models of the search space.

Problem

Research questions and friction points this paper is trying to address.

Improving dataset search for task suitability assessment

Addressing limitations in current dataset search interfaces

Enhancing user feedback and search space understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted query reformulations for dataset search

Semantic search and filtering by dataset content

Dynamic dataset relevance indicators for tasks

🔎 Similar Papers

No similar papers found.