🤖 AI Summary
This study evaluates the readiness of knowledge graphs for identifying coverage gaps and overlaps in policy documents such as insurance contracts. It introduces the first executable and auditable benchmark for this task by aligning natural language clauses to an OWL ontology (TBox/ABox), integrating clause-level evidence excerpts with SPARQL queries to enable systematic assessment based on ontological coverage discrepancies. Compared to purely text-based large language models, the ontology-driven approach demonstrates superior consistency and diagnostic capability across 58 structured scenarios, underscoring the critical advantage of explicit knowledge modeling in interpretable and defensible readiness evaluations. The work also provides a reusable benchmark template to support downstream knowledge graph quality assessment.
📝 Abstract
Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.