🤖 AI Summary
Ontology evaluation via capability questions (CQs) is costly, error-prone, and heavily reliant on domain experts. To address this, we propose OE-Assist—a novel framework that systematically explores the use of large language models (LLMs) for automated and semi-automated CQ validation in ontology assessment. Leveraging a gold-standard dataset comprising 1,393 annotated CQs alongside their corresponding ontologies and ontology stories, OE-Assist employs advanced LLMs (e.g., o1-preview and o3-mini) to perform natural language understanding and logical reasoning, and integrates the validation capability into a Protégé plugin. Experimental results demonstrate that OE-Assist achieves performance comparable to that of average human evaluators, substantially reducing manual effort and error rates. Our key contributions include: (i) the first LLM-driven CQ validation framework specifically designed for ontology evaluation; (ii) a deployable semi-automated assessment tool; and (iii) empirical validation of LLMs’ feasibility and effectiveness in ontology engineering tasks.
📝 Abstract
Ontology evaluation through functional requirements, such as testing via competency question (CQ) verification, is a well-established yet costly, labour-intensive, and error-prone endeavour, even for ontology engineering experts. In this work, we introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated CQ verification. By presenting and leveraging a dataset of 1,393 CQs paired with corresponding ontologies and ontology stories, our contributions present, to our knowledge, the first systematic investigation into large language model (LLM)-assisted ontology evaluation, and include: (i) evaluating the effectiveness of a LLM-based approach for automatically performing CQ verification against a manually created gold standard, and (ii) developing and assessing an LLM-powered framework to assist CQ verification with Protégé, by providing suggestions. We found that automated LLM-based evaluation with o1-preview and o3-mini perform at a similar level to the average user's performance.