ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs

๐Ÿ“… 2025-09-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Knowledge graphs (KGs) suffer from high maintenance costs and incomplete coverage. This paper proposes a production-grade, open-domain knowledge extraction system that integrates ontology-guided large language model (LLM) prompting, dynamic entity-type-aware ontology fragment generation, synergistic schema-rule-and-LLM-based extraction, lightweight secondary-LLM verification, and knowledge alignment and normalization. The system supports both batch and streaming ingestion, extracting and injecting 19 million high-confidence facts from over 9 million Wikipedia pagesโ€”achieving 98.8% precision and significantly improving KG coverage. It attains up to 48% overlap with third-party KGs and reduces average update latency by 50 days. Key contributions include: (1) a hybrid extraction paradigm under dynamically constrained ontologies, and (2) a scalable, high-fidelity end-to-end knowledge injection pipeline.

Technology Category

Application Category

๐Ÿ“ Abstract
Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at https://youtu.be/UcnE3_GsTWs.
Problem

Research questions and friction points this paper is trying to address.

Automating scalable extraction of open-domain facts from web sources
Maintaining knowledge graph freshness and completeness efficiently
Ensuring high-precision ontology-aligned fact extraction using LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ontology-guided prompting for LLMs
Hybrid extraction with pattern-based rules
Lightweight grounding with second LLM validation
๐Ÿ”Ž Similar Papers
No similar papers found.
Samira Khorshidi
Samira Khorshidi
ML Research Engineer, Apple Inc.
NLPKnowledge GraphGANPoint processesAdversarial learning
A
Azadeh Nikfarjam
Apple Inc.
S
Suprita Shankar
Apple Inc.
Yisi Sang
Yisi Sang
ML engineer @ Apple
Human Computer InteractionPsychometricsNLP
Y
Yash Govind
Apple Inc.
H
Hyun Jang
Apple Inc.
A
Ali Kasgari
Apple Inc.
A
Alexis McClimans
Apple Inc.
Mohamed Soliman
Mohamed Soliman
Junior professor, Paderborn University
Software ArchitectureSoftware Engineering
V
Vishnu Konda
Apple Inc.
A
Ahmed Fakhry
Apple Inc.
X
Xiaoguang Qi
Apple Inc.