Hardness of Learning Regular Languages in the Next Symbol Prediction Setting

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This paper investigates the PAC learnability of regular languages under the next-symbol prediction (NSP) setting: the learner receives only positive examples and, for each prefix, observes two labels—whether the prefix belongs to the language and which next symbols extend it to a valid string. To enable rigorous theoretical analysis, the authors formalize the NSP framework and establish a reduction from standard PAC learning. Their main contribution is the first proof that deterministic finite automata (DFAs) remain not efficiently PAC learnable in this information-rich NSP setting, assuming standard cryptographic hardness assumptions. The key technical insight is constructing “label-pseudorandom” instances where the additional NSP labels yield negligible information gain. This result exposes a fundamental theoretical limitation of neural sequence models in modeling regular languages, thereby bridging the gap between empirical observations and computational learning theory.

Technology Category

Application Category

📝 Abstract

We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.

Problem

Research questions and friction points this paper is trying to address.

Studying learnability of languages with next symbol prediction

Analyzing computational hardness of learning DFAs and formulas

Establishing reduction from conventional learning to NSP setting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Next Symbol Prediction setting for language learning

Reduction from conventional learning to NSP labels

Proving DFA learning remains computationally hard

🔎 Similar Papers

What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages