🤖 AI Summary
Longstanding research on the diachronic evolution of Korean has been hindered by the absence of large-scale, openly licensed historical corpora. To address this gap, we present the first open-source Korean historical corpus spanning 13 centuries (7th–20th), comprising 18 million documents and 5 billion tokens. It integrates 19 heterogeneous source texts, supports multiple orthographic systems—including Classical Chinese, Idu, Hanja–Hangul mixed script, and modern Hangul—and undergoes systematic cleaning, orthographic normalization, and linguistic annotation (e.g., part-of-speech, lemmatization). This resource enables the first quantitative, millennium-scale analysis of Korean language change, revealing key phenomena: the peak usage period of Idu, an acceleration phase in the transition to Hangul, and divergent lexical trajectories between North and South Korea. Empirical evaluation further shows that state-of-the-art Korean tokenizers exhibit out-of-vocabulary rates on North Korean texts up to 51× higher than on standard benchmarks—highlighting critical gaps in low-resource dialect modeling and underscoring the corpus’s value for historical pretraining of large language models.
📝 Abstract
The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.