🤖 AI Summary
This paper investigates the fundamental robustness limits of language generation models under infinite contaminated data—specifically, data perturbed by arbitrary insertions and deletions of noise symbols—and asks whether accurate generation of novel strings from the target language remains feasible as the contamination rate tends to zero, particularly under dense generation requirements.
Method: We establish, for the first time, theoretical robustness bounds under enumerative contamination models, integrating limit learning frameworks, membership query oracles, formal dense generation criteria, and a curriculum-inspired progressive learning architecture.
Contribution/Results: We prove that all countable languages are generable in the limit as contamination rate → 0. While dense generation is inherently more fragile under general contamination, it remains achievable under curriculum-based learning. Our work provides the first systematic characterization of generability and robustness for language learning in noisy environments.
📝 Abstract
We study language generation in the limit, where an algorithm observes an adversarial enumeration of strings from an unknown target language $K$ and must eventually generate new, unseen strings from $K$. Kleinberg and Mullainathan [KM24] proved that generation is achievable in surprisingly general settings. But their generator suffers from ``mode collapse,''producing from an ever-smaller subset of the target. To address this, Kleinberg and Wei [KW25] require the generator's output to be ``dense''in the target language. They showed that generation with density, surprisingly, remains achievable at the same generality. Both results assume perfect data: no noisy insertions and no omissions. This raises a central question: how much contamination can generation tolerate? Recent works made partial progress on this question by studying (non-dense) generation with either finite amounts of noise (but no omissions) or omissions (but no noise). We characterize robustness under contaminated enumerations: 1. Generation under Contamination: Language generation in the limit is achievable for all countable collections iff the fraction of contaminated examples converges to zero. When this fails, we characterize which collections are generable. 2. Dense Generation under Contamination: Dense generation is strictly less robust to contamination than generation. As a byproduct, we resolve an open question of Raman and Raman [ICML25] by showing that generation is possible with only membership oracle access under finitely many contaminated examples. Finally, we introduce a beyond-worst-case model inspired by curriculum learning and prove that dense generation is achievable even with infinite contamination provided the fraction of contaminated examples converges to zero. This suggests curriculum learning may be crucial for learning from noisy web data.