🤖 AI Summary
Traditional knowledge graph (KG) schema construction heavily relies on manual curation by domain experts, limiting scalability and maintainability. Method: We propose the first automated KG schema generation method targeting the Shape Expressions (ShEx) formal language, leveraging large language models (LLMs) in a multi-stage pipeline that jointly incorporates local structural patterns and global semantic context from KGs. Contribution/Results: To support rigorous evaluation, we introduce two benchmark datasets—YAGO Schema and Wikidata EntitySchema—and define dedicated metrics for ShEx schema quality. Experiments across multiple large-scale KGs demonstrate that our approach generates highly accurate, formally verifiable ShEx schemas, significantly improving automation and scalability. This work advances the paradigm shift from manual to LLM-driven KG schema engineering and establishes a novel benchmark and methodology for applying LLMs to syntactically strict, formal specification languages.
📝 Abstract
Schemas are vital for ensuring data quality in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in related tasks like ontology engineering, we explore automatic schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with evaluation metrics. The LLM-based pipelines effectively utilize local and global information from knowledge graphs (KGs) to generate validating schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs' strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.