GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing benchmarks lack the capability to quantitatively evaluate large language models (LLMs) under arbitrarily long contexts and high reasoning complexity. Method: We introduce the first arithmetic reasoning benchmark with infinitely scalable difficulty and context length, grounded in computational graph abstraction to model reasoning structure; it enables fine-grained, orthogonal control over both dimensions via structured, controllable noise injection and synthetic data generation. Contribution/Results: Through systematic performance attribution analysis, we discover that model reasoning capability decays sigmoidally with increasing complexity, and exponential growth in computational resources yields only linear performance gains—revealing a fundamental reasoning bottleneck in current long-context LLMs. This work establishes a novel paradigm for quantifiable evaluation and mechanistic investigation of long-context reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Long-context large language models (LLMs) have recently shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs, and the ability to introduce noise by adding unnecessary nodes and edges, we develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-Infinite benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

Problem

Research questions and friction points this paper is trying to address.

evaluate LLMs on infinite context length

assess LLMs' handling of increasing reasoning complexity

develop scalable benchmark for LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates infinite-difficulty math problems

Evaluates LLMs on long-context reasoning

Identifies sigmoid decline in performance

🔎 Similar Papers

ReAttention: Training-Free Infinite Context with Finite Attention Scope