WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing table learning research is constrained by isolated databases and the absence of realistic cross-database associations, hindering progress in federated learning, transfer learning, and tabular foundation models. Method: We introduce WikiDBGraph—the first large-scale, real-world database graph—comprising 100K Wikidata tables, 17M semantic association edges, and 25 structural/distributional attributes. It is the first to systematically model instance- and feature-level overlaps across databases via schema parsing, distributional statistics, and learned edge weighting. Contribution/Results: WikiDBGraph fills a critical gap in interconnected tabular resources, enabling multi-paradigm collaborative learning. Experiments demonstrate that cross-database training grounded in WikiDBGraph significantly improves downstream task performance. It establishes a scalable training paradigm and a new benchmark for structured-data foundation models.

Technology Category

Application Category

📝 Abstract

Tabular data, ubiquitous and rich in informational value, is an increasing focus for deep representation learning, yet progress is hindered by studies centered on single tables or isolated databases, which limits model capabilities due to data scale. While collaborative learning approaches such as federated learning, transfer learning, split learning, and tabular foundation models aim to learn from multiple correlated databases, they are challenged by a scarcity of real-world interconnected tabular resources. Current data lakes and corpora largely consist of isolated databases lacking defined inter-database correlations. To overcome this, we introduce WikiDBGraph, a large-scale graph of 100,000 real-world tabular databases from WikiData, interconnected by 17 million edges and characterized by 13 node and 12 edge properties derived from its database schema and data distribution. WikiDBGraph's weighted edges identify both instance- and feature-overlapped databases. Experiments on these newly identified databases confirm that collaborative learning yields superior performance, thereby offering considerable promise for structured foundation model training while also exposing key challenges and future directions for learning from interconnected tabular data.

Problem

Research questions and friction points this paper is trying to address.

Lack of interconnected tabular datasets hinders collaborative learning models

Current data lakes lack defined inter-database correlations for structured learning

Need for large-scale real-world database graphs to enhance model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale graph of interconnected Wikidata databases

17 million edges with 13 node properties

Weighted edges identify overlapping databases

🔎 Similar Papers

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models