KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The construction of high-quality knowledge graphs (KGs) from heterogeneous data sources lacks reproducible, end-to-end integrated frameworks. Method: This paper proposes KGpipe—a modular, LLM-augmented KG construction framework that unifies core pipeline stages including information extraction, ontology mapping, entity matching, and data fusion. KGpipe natively supports both traditional tools and large language models, and accommodates multi-format inputs (e.g., RDF, JSON, plain text). Contribution/Results: To enable systematic evaluation, we introduce the first multi-format benchmark dataset specifically designed for KG construction pipelines. Experiments demonstrate that KGpipe significantly enhances pipeline configurability and reproducibility, while standardized performance and quality metrics facilitate fair, cross-method comparison. By providing a scalable, empirically verifiable infrastructure, KGpipe advances KG engineering practice and supports rigorous, transparent KG development.

Technology Category

Application Category

📝 Abstract
Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.
Problem

Research questions and friction points this paper is trying to address.

Combining diverse methods for building knowledge graphs from multiple sources
Lacking support for reproducible end-to-end integration pipelines
Evaluating pipelines that integrate heterogeneous data formats into KGs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for defining reproducible KG integration pipelines
Combining existing tools with LLM functionality
Benchmark for evaluating pipelines with heterogeneous data