🤖 AI Summary
Current large language models often suffer from insufficient factual coverage and fragmented relations when constructing knowledge graphs due to direct extraction approaches. This work proposes a question-answering–driven semantic scaffolding mechanism that explicitly models contextual dependencies and implicit relationships prior to triple extraction by generating 5W1H-guided question-answer pairs, thereby structuring and unfolding document semantics in a principled manner. By introducing question-answer pairs as an intermediate representation—a novel strategy to date—the method effectively mitigates the trade-off between coverage and connectivity. Evaluated on the MINE benchmark, the approach significantly improves fact retention and graph cohesion, maintaining high coherence even as the scale of the knowledge base expands substantially.
📝 Abstract
Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.