🤖 AI Summary
Classical Korean historical texts written in Literary Sinitic (Hanja) pose significant comprehension barriers for modern readers and impede scholarly translation and research. Method: This project develops the first open-source, end-to-end processing platform for Korean Hanji texts. It introduces the first open-source NLP pipeline specifically designed for Korean Hanji, integrating a Hanja-customized language model to support sentence segmentation recovery, named entity recognition, and neural machine translation from Literary Sinitic to Korean and English. It also pioneers a human-in-the-loop revision framework and a web-based, character-level annotated dictionary featuring modern Korean pronunciations and English definitions. Contribution/Results: The platform substantially enhances textual accessibility: non-specialists rapidly grasp content via multilingual translations and annotations, while domain experts achieve significantly improved proofreading efficiency. It establishes a scalable, reproducible technical foundation for large-scale modernization and translational processing of Korean Hanji corpora.
📝 Abstract
While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.