🤖 AI Summary
Existing long-context benchmarks predominantly focus on few-shot classification, lacking systematic evaluation of models’ ability to induce implicit functional patterns from hundreds to thousands of examples and generalize to novel instances.
Method: We introduce MIR-Bench, the first long-context benchmark for multi-example contextual inductive reasoning. It generates interpretable, difficulty-stratified, multi-format sequences via controllable synthetic functions; formally defines evaluation protocols for hundred- and thousand-example induction; and integrates programmatic verification, chain-of-thought (CoT) prompting, and noise-robust assessment.
Contribution/Results: Our systematic evaluation reveals that state-of-the-art long-context models underperform significantly relative to their short-context counterparts on such tasks. CoT yields gains only at moderate example scales, not at scale extremes. We identify critical performance bottlenecks—including context-length-dependent degradation in pattern abstraction and generalization—and delineate concrete optimization pathways for future model and prompt engineering.
📝 Abstract
Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually $<$10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.