🤖 AI Summary
This study critically examines the widely held assumption that classical Chinese resources inherently benefit cross-lingual transfer for historical East Asian Sinitic texts (e.g., Korean and Japanese classical Chinese manuscripts). Method: We propose a multi-task evaluation framework—encompassing machine translation, named entity recognition (NER), and punctuation restoration—to empirically assess transfer efficacy across model scales and domain-specific datasets. Contribution/Results: Our findings reveal negligible gains from classical Chinese pretraining: improvements are statistically marginal and only observable under extreme low-resource conditions (≤1k annotated local sentences), with maximal gains of ΔF1 ≤ 0.0068 for NER and ΔBLEU ≤ +0.84 for translation; benefits vanish rapidly as local data volume increases. This work provides the first empirical evidence that classical Chinese’s cross-lingual transfer value for Sinitic text processing is highly conditional—challenging prevailing resource selection paradigms—and offers methodological insights and data strategy guidance for historical East Asian NLP.
📝 Abstract
Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.