🤖 AI Summary
To address the high computational overhead in table joint search under multi-vector models—caused by reliance on bipartite maximum matching—this paper proposes a proximity-graph-based multi-stage retrieval framework. The method replaces exhaustive bipartite matching with a lightweight many-to-one matching filtering strategy and integrates a novel refinement mechanism with an enhanced pruning scheme to jointly reduce candidate set size. By synergistically combining multi-vector embeddings, proximity graph indexing, and hierarchical filtering, the approach achieves 3.6–6.0× speedup across six benchmark datasets while preserving recall performance comparable to the best baseline. This significantly improves both efficiency and scalability of semantic-driven table discovery.
📝 Abstract
Neural embedding models are extensively employed in the table union search problem, which aims to find semantically compatible tables that can be merged with a given query table. In particular, multi-vector models, which represent a table as a vector set (typically one vector per column), have been demonstrated to achieve superior retrieval quality by capturing fine-grained semantic alignments. However, this problem faces more severe efficiency challenges than the single-vector problem due to the inherent dependency on bipartite graph maximum matching to compute unionability scores. Therefore, this paper proposes an efficient Proximity Graph-based Table Union Search (PGTUS) approach. PGTUS employs a multi-stage pipeline that combines a novel refinement strategy, a filtering strategy based on many-to-one bipartite matching. Besides, we propose an enhanced pruning strategy to prune the candidate set, which further improve the search efficiency. Extensive experiments on six benchmark datasets demonstrate that our approach achieves 3.6-6.0X speedup over existing approaches while maintaining comparable recall rates.