KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost and time-intensive nature of constructing high-quality evaluation datasets for large language models by proposing a method that leverages external knowledge sources—such as Wikipedia and Wikidata—to build topic-specific knowledge graphs. These graphs serve as compressed representations to drive the generation of multiple-choice questions without requiring repeated input of original source texts. The proposed framework integrates large language models, knowledge graphs, retrieval-augmented generation (RAG), and a difficulty calibration mechanism, enabling efficient, domain-agnostic question generation with controllable complexity, including multi-hop reasoning. Evaluated across history, biology, and mathematics, the approach produces six high-quality multiple-choice datasets that excel in fluency, unambiguity, topical relevance, and three other metrics, with model performance closely aligning with the MMLU benchmark.

Technology Category

Application Category

📝 Abstract
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
Problem

Research questions and friction points this paper is trying to address.

question generation
knowledge graph
large language models
evaluation dataset
difficulty control
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge graph
multiple-choice question generation
adaptive hardness calibration
reusable representation
LLM-based evaluation
🔎 Similar Papers
No similar papers found.
M
Mohammad Amanlou
University of Tehran
E
Erfan Shafiee Moghaddam
Independent Researcher
Y
Yasaman Amou Jafari
University of Tehran
M
Mahdi Noori
University of Tehran
Farhan Farsi
Farhan Farsi
Computer Engineering Department Amirkabir University of Technology
Natural Language ProcessingLLMsVLMsGenerative AI
Behnam Bahrak
Behnam Bahrak
Tehran Institute for Advanced Studies