🤖 AI Summary
This study addresses the problems of world-knowledge inconsistency and systematic knowledge gaps in large language models (LLMs). To this end, we propose KonTest, a knowledge-graph-based automated consistency testing framework. Methodologically, we introduce a novel consistency verification paradigm integrating semantic-equivalent querying, metamorphic testing, and ontological oracles; additionally, we design a weighted LLM ensemble mechanism to mitigate knowledge omissions. Experimental evaluation across four mainstream LLMs demonstrates that KonTest identifies inconsistent inputs in 19.2% of test cases and detects an average of 16.5% knowledge gaps. Our ensemble strategy reduces knowledge gaps by 32.48%. Overall, KonTest provides a scalable, interpretable, and principled framework for assessing the reliability of LLMs’ factual knowledge—advancing both the rigor and transparency of LLM evaluation.
📝 Abstract
In this work, we systematically expose and measure the inconsistency and knowledge gaps of Large Language Models (LLMs). Specifically, we propose an automated testing framework (called KonTest) which leverages a knowledge graph to construct test cases. KonTest probes and measures the inconsistencies in the LLM's knowledge of the world via a combination of semantically-equivalent queries and test oracles (metamorphic or ontological oracle). KonTest further mitigates knowledge gaps via a weighted LLM model ensemble. Using four state-of-the-art LLMs (Falcon, Gemini, GPT3.5, and Llama2), we show that KonTest generates 19.2% error inducing inputs (1917 errors from 9979 test inputs). It also reveals a 16.5% knowledge gap across all tested LLMs. A mitigation method informed by KonTest's test suite reduces LLM knowledge gap by 32.48%. Our ablation study further shows that GPT3.5 is not suitable for knowledge-based consistency testing because it is only 60%-68% effective in knowledge construction.