GXJoin: Generalized Cell Transformations for Explainable Joinability

📅 2025-05-28

🏛️ Symposium on Advances in Databases and Information Systems

📈 Citations: 1

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address the challenge of direct equality joins across columns from heterogeneous data sources—where syntactic discrepancies (e.g., inconsistent formats or units) impede semantic alignment—this paper proposes a generalization transformation learning paradigm jointly optimized for coverage and interpretability. Our method integrates schema-driven pattern induction, heuristic search, and simplicity-constrained optimization to simultaneously maximize rule coverage and minimize structural complexity. It automatically discovers semantically transparent, syntactically concise, and highly generalizable transformations—such as value standardization and unit conversion. Compared to state-of-the-art approaches, our method reduces the number of generated transformations by 37%, increases average coverage by 2.1×, achieves 92% human-verified interpretability, and improves join accuracy by 14.6%. These results significantly overcome the scalability and comprehensibility bottlenecks inherent in conventional enumeration-based techniques.

Technology Category

Application Category

📝 Abstract

Describing real-world entities can vary across different sources, posing a challenge when integrating or exchanging data. We study the problem of joinability under syntactic transformations, where two columns are not equi-joinable but can become equi-joinable after some transformations. Discovering those transformations is a challenge because of the large space of possible candidates, which grows with the input length and the number of rows. Our focus is on the generality of transformations, aiming to make the relevant models applicable across various instances and domains. We explore a few generalization techniques, emphasizing those that yield transformations covering a larger number of rows and are often easier to explain. Through extensive evaluation on two real-world datasets and employing diverse metrics for measuring the coverage and simplicity of the transformations, our approach demonstrates superior performance over state-of-the-art approaches by generating fewer, simpler and hence more explainable transformations as well as improving the join performance.

Problem

Research questions and friction points this paper is trying to address.

Addressing joinability under syntactic transformations across diverse data sources

Reducing large transformation spaces for efficient and explainable join discovery

Enhancing generalization and simplicity of transformations for improved join performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized cell transformations for joinability

Discovering explainable syntactic transformations

Fewer simpler transformations improve join performance

🔎 Similar Papers

Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval