🤖 AI Summary
This work addresses the challenge of efficiently inferring table join relationships—i.e., constructing join graphs—in enterprise environments where only metadata is accessible. The study makes the novel observation that real-world join graph adjacency matrices exhibit both high sparsity and low-rank structure. Building on this insight, the authors propose Nexus, a new end-to-end paradigm for join graph inference that relies solely on metadata. Nexus formulates the problem as low-rank matrix completion and integrates large language models with an expectation-maximization algorithm to iteratively refine predictions. Experiments across four datasets, including one from a real production environment, demonstrate that Nexus substantially outperforms existing methods, achieving up to a sixfold speedup while maintaining high accuracy.
📝 Abstract
Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.