Nexus: Inferring Join Graphs from Metadata Alone via Iterative Low-Rank Matrix Completion

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently inferring table join relationships—i.e., constructing join graphs—in enterprise environments where only metadata is accessible. The study makes the novel observation that real-world join graph adjacency matrices exhibit both high sparsity and low-rank structure. Building on this insight, the authors propose Nexus, a new end-to-end paradigm for join graph inference that relies solely on metadata. Nexus formulates the problem as low-rank matrix completion and integrates large language models with an expectation-maximization algorithm to iteratively refine predictions. Experiments across four datasets, including one from a real production environment, demonstrate that Nexus substantially outperforms existing methods, achieving up to a sixfold speedup while maintaining high accuracy.

Technology Category

Application Category

📝 Abstract
Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be challenging, especially in enterprise settings where access to data values is constrained. In this paper, we introduce the problem of join graph inference when only metadata is available. We conduct an empirical study on a large number of real-world schemas and observe that join graphs when represented as adjacency matrices exhibit two key properties: high sparsity and low-rank structure. Based on these novel observations, we formulate join graph inference as a low-rank matrix completion problem and propose Nexus, an end-to-end solution using only metadata. To further enhance accuracy, we propose a novel Expectation-Maximization algorithm that alternates between low-rank matrix completion and refining join candidate probabilities by leveraging Large Language Models. Our extensive experiments demonstrate that Nexus outperforms existing methods by a significant margin on four datasets including a real-world production dataset. Additionally, Nexus can operate in a fast mode, providing comparable results with up to 6x speedup, offering a practical and efficient solution for real-world deployments.
Problem

Research questions and friction points this paper is trying to address.

join graph inference
metadata-only
schema integration
data discovery
low-rank matrix completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

join graph inference
low-rank matrix completion
metadata-only
Expectation-Maximization
Large Language Models
🔎 Similar Papers
No similar papers found.