🤖 AI Summary
Existing long-context benchmarks lack controllable complexity design, hindering fine-grained evaluation of models’ information extraction and reasoning capabilities. To address this, we propose QAGen—the first question-answering generation framework integrating a financial protocol knowledge graph with structured complexity dimensions—enabling progressive difficulty control along three axes: multi-hop retrieval, set-theoretic operations, and answer multiplicity. Our methodology comprises (1) domain-specific knowledge graph construction, (2) rule-guided multi-hop QA generation, (3) formal modeling of set logic, and (4) structured parsing of financial documents. We release FinLongQA, the largest long-context evaluation dataset to date (20,139 QA pairs, partially open-sourced). Empirical evaluation across 13 mainstream LLMs reveals systematic bottlenecks in set comparison and implicit relational reasoning—highlighting critical gaps in current long-context reasoning capabilities.
📝 Abstract
The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.