KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context benchmarks lack controllable complexity design, hindering fine-grained evaluation of models’ information extraction and reasoning capabilities. To address this, we propose QAGen—the first question-answering generation framework integrating a financial protocol knowledge graph with structured complexity dimensions—enabling progressive difficulty control along three axes: multi-hop retrieval, set-theoretic operations, and answer multiplicity. Our methodology comprises (1) domain-specific knowledge graph construction, (2) rule-guided multi-hop QA generation, (3) formal modeling of set logic, and (4) structured parsing of financial documents. We release FinLongQA, the largest long-context evaluation dataset to date (20,139 QA pairs, partially open-sourced). Empirical evaluation across 13 mainstream LLMs reveals systematic bottlenecks in set comparison and implicit relational reasoning—highlighting critical gaps in current long-context reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to retrieve information from long documents
Systematically varying question complexity for fine-grained assessment
Assessing model performance on multi-hop retrieval and set operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-graph-based QA pair extraction
Multi-level complexity via financial agreements
Fine-grained assessment of model performance
🔎 Similar Papers
No similar papers found.