Unified Data Discovery across Query Modalities and User Intents

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
Existing data discovery methods are often confined to a single query modality—such as natural language or tabular queries—or tailored to specific user intents, limiting their generalizability. This work proposes UniDisc, a unified framework that, for the first time, jointly models both natural language and tabular queries without requiring intent-specific design. UniDisc constructs a data lake–driven heterogeneous graph and leverages heterogeneous graph neural networks, dual-view neighbor aggregation, and a joint optimization strategy to learn robust cross-modal embeddings under weak supervision. Evaluated on seven benchmark datasets, UniDisc significantly outperforms strong existing baselines in data discovery tasks across both query modalities, demonstrating exceptional versatility and performance.
📝 Abstract
Data discovery - retrieving relevant tables from a data lake in response to user queries - is a fundamental building block for downstream analytics. In practice, data discovery must support different query modalities, including natural language (NL) statements and tables, and accommodate diverse user intents, ranging from open-ended enrichment to task-driven inference for applications such as table question answering and fact verification. However, most existing methods are designed for a single query modality or a specific user intent, limiting their generalizability. We propose UniDisc, a unified data discovery framework that supports both NL statements and tables as queries and generalizes across diverse user intents without intent-specific representations or relevance modeling. UniDisc learns a common cross-modal representation model that produces unified representations for queries of different modalities and candidate tables, enabling uniform relevance assessment across discovery scenarios. Since learning such a model typically requires large labeled collections of query-table pairs, which are expensive to obtain, UniDisc instead exploits contextual signals naturally available in data lakes. Specifically, it models NL statements and tables as nodes in a heterogeneous graph with multiple edge types, and applies dual-view neighbor aggregation and joint optimization to learn robust, context-aware representations under limited supervision. These representations support flexible relevance estimation during retrieval. Experiments on seven datasets show that UniDisc consistently outperforms strong baselines on both NL- and table-based discovery.
Problem

Research questions and friction points this paper is trying to address.

data discovery
query modalities
user intents
natural language
tables
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified data discovery
cross-modal representation
heterogeneous graph
neighbor aggregation
query-table retrieval
🔎 Similar Papers
No similar papers found.