🤖 AI Summary
This study empirically investigates the learnability of probabilistic regular languages (PRLs) by recurrent neural networks (RNNs) and Transformer-based large language models, focusing on how model complexity, parameter count, and distributional complexity affect learning performance. Methodologically, we introduce two novel learnability predictors: the RLM rank—defined as the dimension of the linear subspace spanned by conditional logits—and the expected length of sampled strings, which jointly capture internal representation complexity and structural language difficulty. We develop a unified PRL generation and evaluation framework and employ multivariate regression to analyze predictive power across architectures. Results show that RLM rank and expected string length serve as strong, architecture-agnostic predictors of learnability; additionally, several architecture-specific auxiliary predictors are identified. The findings reveal both shared representational capacities and fundamental differences in inductive biases across models when learning PRLs, establishing a new paradigm for formal, interpretable, and quantitatively grounded analysis of neural language model learnability.
📝 Abstract
What can large language models learn? By definition, language models (LM) are distributions over strings. Therefore, an intuitive way of addressing the above question is to formalize it as a matter of learnability of classes of distributions over strings. While prior work in this direction focused on assessing the theoretical limits, in contrast, we seek to understand the empirical learnability. Unlike prior empirical work, we evaluate neural LMs on their home turf-learning probabilistic languages-rather than as classifiers of formal languages. In particular, we investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs as a function of various complexity parameters of the RLM and the hidden state size of the neural LM. We find that the RLM rank, which corresponds to the size of linear space spanned by the logits of its conditional distributions, and the expected length of sampled strings are strong and significant predictors of learnability for both RNNs and Transformers. Several other predictors also reach significance, but with differing patterns between RNNs and Transformers.