Unified Data Discovery across Query Modalities and User Intents

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing data discovery methods are often confined to a single query modality—such as natural language or tabular queries—or tailored to specific user intents, limiting their generalizability. This work proposes UniDisc, a unified framework that, for the first time, jointly models both natural language and tabular queries without requiring intent-specific design. UniDisc constructs a data lake–driven heterogeneous graph and leverages heterogeneous graph neural networks, dual-view neighbor aggregation, and a joint optimization strategy to learn robust cross-modal embeddings under weak supervision. Evaluated on seven benchmark datasets, UniDisc significantly outperforms strong existing baselines in data discovery tasks across both query modalities, demonstrating exceptional versatility and performance.

📝 Abstract

Data discovery - retrieving relevant tables from a data lake in response to user queries - is a fundamental building block for downstream analytics. In practice, data discovery must support different query modalities, including natural language (NL) statements and tables, and accommodate diverse user intents, ranging from open-ended enrichment to task-driven inference for applications such as table question answering and fact verification. However, most existing methods are designed for a single query modality or a specific user intent, limiting their generalizability. We propose UniDisc, a unified data discovery framework that supports both NL statements and tables as queries and generalizes across diverse user intents without intent-specific representations or relevance modeling. UniDisc learns a common cross-modal representation model that produces unified representations for queries of different modalities and candidate tables, enabling uniform relevance assessment across discovery scenarios. Since learning such a model typically requires large labeled collections of query-table pairs, which are expensive to obtain, UniDisc instead exploits contextual signals naturally available in data lakes. Specifically, it models NL statements and tables as nodes in a heterogeneous graph with multiple edge types, and applies dual-view neighbor aggregation and joint optimization to learn robust, context-aware representations under limited supervision. These representations support flexible relevance estimation during retrieval. Experiments on seven datasets show that UniDisc consistently outperforms strong baselines on both NL- and table-based discovery.

Problem

Research questions and friction points this paper is trying to address.

data discovery

query modalities

user intents

natural language

tables

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified data discovery

cross-modal representation

heterogeneous graph