On Provable Length and Compositional Generalization

📅 2024-02-07

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

197K/year

🤖 AI Summary

This work investigates two fundamental out-of-distribution generalization capabilities of sequence-to-sequence models: length generalization (to longer sequences) and compositional generalization (to unseen token combinations). We establish the first unified theoretical framework—grounded in function approximation theory and statistical learning theory—that provides provable generalization guarantees for deep sets, Transformers, state-space models, and RNNs under capacity constraints including weight norm, rank, and number of attention heads. Methodologically, we rigorously distinguish structured versus unstructured model restrictions and quantify the trade-off between training distribution diversity and model architecture. Our key result proves that, when the training distribution satisfies a precise diversity condition, all aforementioned capacity-constrained models achieve zero generalization error. This yields the first capacity-aware, architecture-specific generalization bounds for compositional and length extrapolation in modern sequence models.

Technology Category

Application Category

📝 Abstract

Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models -- deep sets, transformers, state space models, and recurrent neural nets -- trained to minimize the prediction error. We show that limited capacity versions of these different architectures achieve both length and compositional generalization provided the training distribution is sufficiently diverse. In the first part, we study structured limited capacity variants of different architectures and arrive at the generalization guarantees with limited diversity requirements on the training distribution. In the second part, we study limited capacity variants with less structural assumptions and arrive at generalization guarantees but with more diversity requirements on the training distribution.

Problem

Research questions and friction points this paper is trying to address.

Studying length and compositional generalization in sequence-to-sequence models

Providing guarantees for different architectures with limited capacity

Investigating chain-of-thought supervision for length generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Limited capacity architectures enable generalization

Diverse training data ensures compositional generalization

Chain-of-thought supervision aids length generalization

🔎 Similar Papers

A Formal Framework for Understanding Length Generalization in Transformers