Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Longstanding research on the diachronic evolution of Korean has been hindered by the absence of large-scale, openly licensed historical corpora. To address this gap, we present the first open-source Korean historical corpus spanning 13 centuries (7th–20th), comprising 18 million documents and 5 billion tokens. It integrates 19 heterogeneous source texts, supports multiple orthographic systems—including Classical Chinese, Idu, Hanja–Hangul mixed script, and modern Hangul—and undergoes systematic cleaning, orthographic normalization, and linguistic annotation (e.g., part-of-speech, lemmatization). This resource enables the first quantitative, millennium-scale analysis of Korean language change, revealing key phenomena: the peak usage period of Idu, an acceleration phase in the transition to Hangul, and divergent lexical trajectories between North and South Korea. Empirical evaluation further shows that state-of-the-art Korean tokenizers exhibit out-of-vocabulary rates on North Korean texts up to 51× higher than on standard benchmarks—highlighting critical gaps in low-resource dialect modeling and underscoring the corpus’s value for historical pretraining of large language models.

Technology Category

Application Category

📝 Abstract

The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

Problem

Research questions and friction points this paper is trying to address.

Analyzing the historical linguistic evolution of Korean language over centuries

Addressing the lack of accessible historical corpora for Korean NLP research

Quantifying major script transitions from Hanja to Hangul and lexical divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale diachronic corpus spanning 1300 years

Collected 18 million documents from 19 diverse sources

Enabled quantitative analysis of linguistic shifts and divergence

🔎 Similar Papers

No similar papers found.