MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing long-context benchmarks predominantly focus on few-shot classification, lacking systematic evaluation of models’ ability to induce implicit functional patterns from hundreds to thousands of examples and generalize to novel instances. Method: We introduce MIR-Bench, the first long-context benchmark for multi-example contextual inductive reasoning. It generates interpretable, difficulty-stratified, multi-format sequences via controllable synthetic functions; formally defines evaluation protocols for hundred- and thousand-example induction; and integrates programmatic verification, chain-of-thought (CoT) prompting, and noise-robust assessment. Contribution/Results: Our systematic evaluation reveals that state-of-the-art long-context models underperform significantly relative to their short-context counterparts on such tasks. CoT yields gains only at moderate example scales, not at scale extremes. We identify critical performance bottlenecks—including context-length-dependent degradation in pattern abstraction and generalization—and delineate concrete optimization pathways for future model and prompt engineering.

Technology Category

Application Category

📝 Abstract

Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually $<$10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM's long-context inductive reasoning

Introduce many-shot in-context learning benchmark

Study robustness and Chain-of-Thought effects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Many-shot In-Context Learning

Diverse Data Format Inductive Reasoning

Robustness and Chain-of-Thought Analysis

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues