🤖 AI Summary
Large language models (LLMs) exhibit systematic deficits in compositional generalization for abstract reasoning tasks—particularly algebraic evaluation under out-of-distribution operator precedence rules (e.g., addition before multiplication). To address this, we propose iterative in-context learning: dynamically selecting and refining a small set of demonstration examples, augmented with explicit step-by-step reasoning instructions and iterative prompting. Crucially, we find that using *simpler* (rather than distribution-matched) examples—those with fewer operators or shallower nesting than test instances—significantly improves zero-shot generalization, challenging conventional wisdom on example selection. Empirical evaluation across multiple nonstandard precedence benchmarks demonstrates substantial gains over strong baselines: up to +27.4% absolute accuracy improvement in addition-first arithmetic. Our approach offers a novel pathway toward enhancing LLMs’ structured reasoning and compositional generalization capabilities without architectural modification or fine-tuning.
📝 Abstract
LLMs face significant challenges in systematic generalization, particularly when dealing with reasoning tasks requiring compositional rules and handling out-of-distribution examples. To address these challenges, we introduce an in-context learning methodology that improves the generalization capabilities of general purpose LLMs. Our approach employs an iterative example selection strategy, which incrementally constructs a tailored set of few-shot examples optimized to enhance model's performance on a given task. As a proof of concept, we apply this methodology to the resolution of algebraic expressions involving non-standard simplification rules, according to which the priority of addition and multiplication is changed.
Our findings indicate that LLMs exhibit limited proficiency in these mathematical tasks. We further demonstrate that LLMs reasoning benefits from our iterative shot selection prompting strategy integrated with explicit reasoning instructions. Crucially, our experiments reveal that some LLMs achieve better generalization performances when prompted with simpler few-shot examples rather than complex ones following the test data distribution.