🤖 AI Summary
In enterprise data lakes, identifying joinable tables solely via column-level syntactic or semantic similarity is insufficient; neglecting query-column context constitutes a critical bottleneck. This paper proposes TOPJoin, a context-aware, multi-criteria column joinability modeling framework. We formally define “context-aware column joinability” for the first time and integrate column embeddings, value embeddings, set overlap measures, and query-context features, employing multi-criteria weighted fusion for end-to-end joinability scoring. Extensive experiments on both academic and real-world enterprise datasets demonstrate that TOPJoin significantly outperforms state-of-the-art baselines, achieving new SOTA results in both recall and precision. These results empirically validate the pivotal role of contextual modeling in accurate joinable column discovery.
📝 Abstract
One of the major challenges in enterprise data analysis is the task of finding joinable tables that are conceptually related and provide meaningful insights. Traditionally, joinable tables have been discovered through a search for similar columns, where two columns are considered similar syntactically if there is a set overlap or they are considered similar semantically if either the column embeddings or value embeddings are closer in the embedding space. However, for enterprise data lakes, column similarity is not sufficient to identify joinable columns and tables. The context of the query column is important. Hence, in this work, we first define context-aware column joinability. Then we propose a multi-criteria approach, called TOPJoin, for joinable column search. We evaluate TOPJoin against existing join search baselines over one academic and one real-world join search benchmark. Through experiments, we find that TOPJoin performs better on both benchmarks than the baselines.