🤖 AI Summary
For privacy-sensitive, access-restricted data (e.g., healthcare and governmental datasets), this paper addresses the challenge of table union search—determining whether two tables can be meaningfully vertically joined—without accessing raw data. We propose the first fully metadata-based unionability detection method, departing from conventional approaches reliant on original tabular content. Our method introduces a semantic-driven metadata matching framework: it employs semantic embeddings to represent metadata, models column-level semantic similarity, and integrates structural and contextual features via a lightweight graph neural network. Evaluated on real-world restricted-data scenarios, our approach achieves 81% accuracy in unionability judgment, with precision and recall significantly outperforming existing baselines. This work establishes the first high-accuracy, metadata-only unionability inference technique, enabling verifiable, regulation-compliant data integration and discovery under the FAIR principles while preserving data privacy.
📝 Abstract
Over the past decade, the Table Union Search (TUS) task has aimed to identify unionable tables within data lakes to improve data integration and discovery. While numerous solutions and approaches have been introduced, they primarily rely on open data, making them not applicable to restricted access data, such as medical records or government statistics, due to privacy concerns. Restricted data can still be shared through metadata, which ensures confidentiality while supporting data reuse. This paper explores how TUS can be computed on restricted access data using metadata alone. We propose a method that achieves 81% accuracy in unionability and outperforms existing benchmarks in precision and recall. Our results highlight the potential of metadata-driven approaches for integrating restricted data, facilitating secure data discovery in privacy-sensitive domains. This aligns with the FAIR principles, by ensuring data is Findable, Accessible, Interoperable, and Reusable while preserving confidentiality.