🤖 AI Summary
This work addresses the high cost and time-intensive nature of constructing high-quality evaluation datasets for large language models by proposing a method that leverages external knowledge sources—such as Wikipedia and Wikidata—to build topic-specific knowledge graphs. These graphs serve as compressed representations to drive the generation of multiple-choice questions without requiring repeated input of original source texts. The proposed framework integrates large language models, knowledge graphs, retrieval-augmented generation (RAG), and a difficulty calibration mechanism, enabling efficient, domain-agnostic question generation with controllable complexity, including multi-hop reasoning. Evaluated across history, biology, and mathematics, the approach produces six high-quality multiple-choice datasets that excel in fluency, unambiguity, topical relevance, and three other metrics, with model performance closely aligning with the MMLU benchmark.
📝 Abstract
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.