The Coverage Principle: A Framework for Understanding Compositional Generalization

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Large language models excel at pattern matching but exhibit limited systematic compositional generalization—particularly in multi-hop reasoning—due to path ambiguity and insufficient training coverage. This work introduces the *coverage principle*, revealing that generalization capability is fundamentally constrained by the density of training data coverage over contextual fragment substitutions. We propose the first data-driven theoretical framework for compositional generalization, featuring a mechanism-oriented taxonomy of three generalization types: structural, attributive, and shared-operator. Through Transformer-based empirical analysis, algebraic invariance modeling, and complexity-theoretic derivation, we identify path ambiguity as a root cause of distorted state representations. Our theory shows that data requirements for two-hop generalization scale quadratically with vocabulary size; parameter scaling does not improve data efficiency; and chain-of-thought reasoning enhances efficiency but cannot resolve path ambiguity. Experimental results strongly align with theoretical predictions.

Technology Category

Application Category

📝 Abstract

Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a emph{mechanism-based} taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.

Problem

Research questions and friction points this paper is trying to address.

Models fail in systematic compositional generalization beyond pattern matching

Training data grows quadratically for two-hop generalization without efficiency gains

Transformers struggle with path ambiguity and context-dependent state representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes coverage principle for compositional generalization

Shows Transformers' quadratic data growth for generalization

Highlights Chain-of-Thought aids multi-hop tasks

🔎 Similar Papers

When does compositional structure yield compositional generalization? A kernel theory