🤖 AI Summary
This paper investigates the progressive generation and recognition of unknown formal languages under incremental/partial enumeration. Specifically, it addresses the setting where a language is revealed only through an infinite subset of density α. The authors propose a unified analytical framework grounded in density-based metrics and topological space modeling, integrating formal language theory, computability, and limit learning. Key contributions are: (1) resolving a long-standing open problem by establishing the optimal lower bound for generative density—namely, the tight bound α/2 under partial enumeration; (2) providing a topological characterization of language recognition, and proving, for the first time, the equivalence between Angluin’s learnability condition and TD-separability; and (3) deriving exact density thresholds and necessary and sufficient conditions for both generation and recognition capabilities, thereby achieving a theoretical breakthrough in completeness.
📝 Abstract
The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow emph{partial enumeration}, where the adversary reveals only an infinite subset $C subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $alpha$ in $K$, the algorithm's output achieves density at least $alpha/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C subseteq M subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.