🤖 AI Summary
Existing connectable column discovery methods exhibit unstable performance across diverse data environments and lack clarity regarding the impact mechanisms of multiple evaluation criteria. To address this, this paper systematically evaluates the effectiveness of syntactic and semantic approaches in context-aware search. We propose an integrated ranking strategy that jointly leverages metadata and value-level semantics, and—novelly—develop a method selection guideline grounded in six data characteristics: uniqueness ratio, intersection size, join cardinality, schema similarity, value distribution skewness, and domain coverage. Experimental results on real-world data lakes and relational databases demonstrate that the integrated approach significantly outperforms single-criterion baselines; semantic features dominate performance in data lake settings, whereas scale-oriented metrics (e.g., cardinality, uniqueness) prevail in relational databases. Our context-adaptive discovery framework is empirically validated, offering an interpretable, reusable methodology for automated enterprise data analysis.
📝 Abstract
Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating multiple criteria for robust join discovery. We provide empirical evidence on when each criterion matters, compare pre-trained embedding models for semantic joins, and offer practical guidelines for selecting suitable methods based on dataset characteristics. Our findings show that metadata and value semantics are crucial for data lakes, size-based criteria play a stronger role in relational databases, and ensemble approaches consistently outperform single-criterion methods.