Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current large language models are constrained by fixed context windows, hindering efficient and consistent cross-document reasoning over extensive document collections. This work proposes SLIDERS, a novel framework that integrates structured relational databases with a data harmonization mechanism: it constructs a queryable knowledge base via information extraction, enables scalable reasoning through SQL queries, and ensures consistency by leveraging provenance, extraction rationales, and metadata. By moving beyond conventional text concatenation and chunk-wise aggregation, SLIDERS achieves state-of-the-art performance, outperforming GPT-4.1 by an average of 6.6 points across three standard long-context benchmarks. Moreover, on two newly introduced large-scale benchmarks comprising 3.9 million and 36 million tokens, respectively, it surpasses the strongest baselines by approximately 19 and 32 points.

Technology Category

Application Category

📝 Abstract

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

Problem

Research questions and friction points this paper is trying to address.

long document question answering

context window limitation

evidence aggregation

scalable reasoning

document collections

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning

relational database

data reconciliation