🤖 AI Summary
Language models exhibit poor systematic generalization—e.g., semantic-equivalent reordering or compositional reuse of known concepts in novel contexts—and existing benchmarks lack interpretable, quantitative measures of task difficulty.
Method: This paper introduces the first formalization of systematic generalization difficulty as the information entropy of component distributions in training data, establishing an entropy-based framework for quantifying sequence-to-sequence task difficulty.
Contribution/Results: Through cross-architecture empirical analysis (including Transformers), we demonstrate that model performance monotonically degrades with increasing entropy, confirming information efficiency as a key driver of systematic generalization. We further show that strong generalization emerges in high-entropy settings without structural priors, while low-entropy tasks serve as robust and reliable benchmarks for evaluating systematic generalization. These findings provide both a theoretically grounded difficulty metric and actionable insights for model design and evaluation.
📝 Abstract
Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.