🤖 AI Summary
This work addresses the challenge that information on a given topic in data lakes is often scattered across multiple tables, while existing table join search methods primarily rely on column-level features and neglect table-level semantics, limiting both ranking quality and efficiency. To overcome this, the authors propose TACTUS, a novel table-centric search paradigm. In the offline phase, TACTUS learns table embeddings by constructing positive and negative table pairs and employing attention-based encoding to model inter-table joinability. During online query processing, it first efficiently retrieves a candidate set of tables and then performs reranking by integrating evidence from both table-level and column-level signals. Experiments on real-world datasets demonstrate that TACTUS significantly improves search effectiveness while achieving an order-of-magnitude speedup in both offline preprocessing and online query latency compared to state-of-the-art methods.
📝 Abstract
In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or ground-truth labels. Existing methods are mainly column-centric: they focus on modeling column unionability scores using column embeddings, which are then used throughout the search process for column matching, filtering, and aggregation. However, this overlooks holistic table-level semantics, which may result in suboptimal rankings and inefficiencies. We introduce TACTUS, a novel table-centric method for table union search. Unlike prior work that searches from columns to tables, we search in a table-first way and examine columns only in the final step. During offline processing, we directly generate table embeddings for holistic, table-level unionability scoring by designing table-level representation techniques, including positive table pair construction to simulate unionable tables, two-pronged negative table sampling to avoid latent positives and mine hard negatives to enhance representation quality, and attentive table encoding for effective embeddings. During online search, we first develop a table-centric adaptive candidate retrieval method that efficiently selects a compact, high-quality candidate pool by leveraging the distribution of table-level unionability scores induced by table embeddings. We then inspect columns only within this compact candidate set and design a dual-evidence reranking technique that integrates table-level and column-level scores to refine the final top-k results. Extensive experiments on real-world datasets show that TACTUS significantly improves result quality while being much faster than existing methods in both offline and online processing, often by an order of magnitude.