TOPJoin: A Context-Aware Multi-Criteria Approach for Joinable Column Search

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

In enterprise data lakes, identifying joinable tables solely via column-level syntactic or semantic similarity is insufficient; neglecting query-column context constitutes a critical bottleneck. This paper proposes TOPJoin, a context-aware, multi-criteria column joinability modeling framework. We formally define “context-aware column joinability” for the first time and integrate column embeddings, value embeddings, set overlap measures, and query-context features, employing multi-criteria weighted fusion for end-to-end joinability scoring. Extensive experiments on both academic and real-world enterprise datasets demonstrate that TOPJoin significantly outperforms state-of-the-art baselines, achieving new SOTA results in both recall and precision. These results empirically validate the pivotal role of contextual modeling in accurate joinable column discovery.

Technology Category

Application Category

📝 Abstract

One of the major challenges in enterprise data analysis is the task of finding joinable tables that are conceptually related and provide meaningful insights. Traditionally, joinable tables have been discovered through a search for similar columns, where two columns are considered similar syntactically if there is a set overlap or they are considered similar semantically if either the column embeddings or value embeddings are closer in the embedding space. However, for enterprise data lakes, column similarity is not sufficient to identify joinable columns and tables. The context of the query column is important. Hence, in this work, we first define context-aware column joinability. Then we propose a multi-criteria approach, called TOPJoin, for joinable column search. We evaluate TOPJoin against existing join search baselines over one academic and one real-world join search benchmark. Through experiments, we find that TOPJoin performs better on both benchmarks than the baselines.

Problem

Research questions and friction points this paper is trying to address.

Identifying joinable tables with conceptual relevance in enterprise data

Enhancing joinable column search using context-aware multi-criteria approach

Improving accuracy over traditional syntactic or semantic similarity methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware column joinability definition

Multi-criteria approach for join search

TOPJoin outperforms existing baselines

🔎 Similar Papers

Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval