🤖 AI Summary
Traditional full outer join operators struggle to effectively integrate heterogeneous, multi-source tables in data lakes due to semantic approximations—such as abbreviations and synonyms—that violate strict equality assumptions.
Method: This paper introduces the first data-driven fuzzy join framework that embeds fuzzy matching into the full outer join operator. Operating without strong schema constraints, it integrates similarity-aware tuple matching, lightweight feature learning, and an optimized execution engine to achieve lossless information fusion.
Contribution/Results: The framework breaks the long-standing limitation of full outer joins to exact-value matching, significantly enhancing semantic-aware integration in open, real-world data environments. Compared to state-of-the-art methods, it incurs negligible additional time overhead while improving integration coverage by 23.6% and semantic completeness by 31.4%.
📝 Abstract
Data integration is an important step in any data science pipeline where the objective is to unify the information available in different datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent an open data scenario, where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full Disjunction to also account for"fuzzy"matches among tuples. We present a novel data-driven approach to enable the joining of approximate or fuzzy matches within Full Disjunction. Experimentally, we show that fuzzy Full Disjunction does not add significant time overhead over a state-of-the-art Full Disjunction implementation and also that it enhances the integration effectiveness.