Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context evaluation benchmarks predominantly rely on local retrieval, failing to rigorously assess models’ capacity for global information aggregation and deep reasoning. Method: We introduce Oolong, the first benchmark to systematically distinguish synthetic from real-world dialogue scenarios, featuring distributed tasks requiring atomic-level chunk analysis, cross-segment classification, counting, and temporal/relational reasoning. Our methodology integrates controllable synthetic data generation, in-context multi-step aggregative reasoning, and realistic conversational structure modeling—substantially increasing both task difficulty and ecological validity. Contribution/Results: Experiments reveal that state-of-the-art models—including GPT-5, Claude Sonnet-4, and Gemini 2.5 Pro—achieve less than 50% accuracy under 128K-context settings, exposing critical deficiencies in long-context information integration and complex reasoning. Oolong establishes a more rigorous, discriminative standard for evaluating long-context capabilities in foundation models.

Technology Category

Application Category

📝 Abstract
As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context reasoning and aggregation capabilities in models
Assessing model performance on atomic-level text analysis and distributional questions
Testing reasoning over large text quantities with classification and counting tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates long-context reasoning via chunk-level analysis
Uses synthetic and real-world conversational data tasks
Measures aggregation of atomic analyses into distributional answers