🤖 AI Summary
This work addresses the limitations of existing language recognition and generation methods, which typically rely on strong realizability assumptions requiring input data to conform to a predefined language distribution, thereby struggling in open-world settings. For the first time, we study this problem under a fully agnostic setting, abandoning all assumptions about the underlying data distribution and proposing a novel objective function that enables a universal learning framework. Drawing upon statistical learning theory and information theory, we provide a new theoretical characterization of both language modeling and recognition tasks, and derive nearly tight statistical convergence rates. Our analysis substantially extends the theoretical applicability boundaries of current approaches across both task categories.
📝 Abstract
Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general"agnostic"setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.