🤖 AI Summary
This paper addresses the “few-shot training–large-scale compositional generalization” problem in task generalization: can a model trained on only $d$ base tasks generalize to a task family of size $d^T$, generated by $T$-step autoregressive composition (ARC)? We propose the ARC modeling framework and provide the first theoretical guarantee—showing that $ ilde{O}(d)$ task examples suffice for complete generalization. Empirically, we demonstrate that Transformers achieve exponential task generalization on sparse parity benchmarks via in-context learning (ICL) and chain-of-thought (CoT) reasoning, and successfully transfer this capability to arithmetic operations (addition, subtraction, multiplication, division) and multilingual translation—validating ARC’s cross-domain efficacy across logical, numerical, and semantic tasks. Our core contribution is the establishment of the first theoretical model supporting exponential task generalization, coupled with evidence that large language models implicitly learn ARC structure during training.
📝 Abstract
Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of AutoRegressive Compositional (ARC) structure, where each task is a composition of $T$ operations, and each operation is among a finite family of $d$ subtasks. This yields a total class of size~( d^TT ). We first show that generalization to all ( d^TT ) tasks is theoretically achievable by training on only ( ilde{O}(d) ) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via in-context learning (ICL) and Chain-of-Thought (CoT) reasoning. We further demonstrate this generalization in arithmetic and language translation, extending beyond parity functions.