QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In heterogeneous data warehouses, inconsistent identifier formats and cross-column value distributions render conventional exact and approximate join discovery methods ineffective. This paper proposes a reinforcement learning–based framework for discovering reusable transformation strategies. It introduces a uniqueness-aware reward mechanism that jointly optimizes matching similarity and key discriminability; incorporates proxy transfer and transformation reuse to enable cross-task policy learning and sharing—the first such approach; and integrates fuzzy matching with operator sequence caching to construct compact, efficient transformation chains. Evaluated on the AutoJoin Web benchmark, the method achieves an F1-score of 91.0%; on the New York + Chicago open datasets, it reduces runtime by up to 7.4% (13,747 seconds). The framework significantly improves both accuracy and efficiency in approximate join discovery.

Technology Category

Application Category

📝 Abstract
Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To accelerate new joins, we introduce two reuse mechanisms: (i) agent transfer, which initializes new policies from pretrained agents, and (ii) transformation reuse, which caches successful operator sequences for similar column clusters. On the AutoJoin Web benchmark (31 table pairs), QJoin achieves an average F1-score of 91.0%. For 19,990 join tasks in NYC+Chicago open datasets, Qjoin reduces runtime by up to 7.4% (13,747 s) by using reusing. These results demonstrate that transformation learning and reuse can make join discovery both more accurate and more efficient.
Problem

Research questions and friction points this paper is trying to address.

Discovers joinable tables with transformations in heterogeneous data repositories.
Addresses inconsistent identifier formatting in open or federated data settings.
Learns and reuses transformation strategies to improve accuracy and efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning to learn transformation strategies for joins
Introduces uniqueness-aware reward balancing similarity and distinctiveness
Employs agent transfer and transformation reuse for efficiency
🔎 Similar Papers
2024-04-15Annual Meeting of the Association for Computational LinguisticsCitations: 4