Efficient and Effective Table-Centric Table Union Search in Data Lakes

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that information on a given topic in data lakes is often scattered across multiple tables, while existing table join search methods primarily rely on column-level features and neglect table-level semantics, limiting both ranking quality and efficiency. To overcome this, the authors propose TACTUS, a novel table-centric search paradigm. In the offline phase, TACTUS learns table embeddings by constructing positive and negative table pairs and employing attention-based encoding to model inter-table joinability. During online query processing, it first efficiently retrieves a candidate set of tables and then performs reranking by integrating evidence from both table-level and column-level signals. Experiments on real-world datasets demonstrate that TACTUS significantly improves search effectiveness while achieving an order-of-magnitude speedup in both offline preprocessing and online query latency compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.
Problem

Research questions and friction points this paper is trying to address.

table union search
data lakes
table-level semantics
column-centric methods
unionability
Innovation

Methods, ideas, or system contributions that make the work stand out.

table-centric
table union search
table embedding
negative sampling
dual-evidence reranking
🔎 Similar Papers
No similar papers found.