PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Khmer, a low-resource language with distinctive linguistic properties—including space-free tokenization and morphological irregularity—has been chronically underrepresented in multilingual models. Method: We propose the first lightweight end-to-end sequence-to-sequence model pretrained exclusively for Khmer. Built from scratch on the BART architecture, it leverages a high-quality Khmer–English bilingual parallel corpus and incorporates a language-aware module featuring rule-based tokenization, explicit whitespace modeling, and context-aware normalized tokenization. Contribution/Results: Our model achieves the first purely monolingual Khmer generation capability—without back-translation or data augmentation. It outperforms mBART50 across machine translation, summarization, and headline generation. Whitespace generation accuracy improves by 23.6%, and the model’s compact 170M-parameter design enables efficient edge deployment.

Technology Category

Application Category

📝 Abstract

This work introduces {it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

Problem

Research questions and friction points this paper is trying to address.

Develops a pre-trained sequence-to-sequence model for Khmer language generation.

Improves corpus quality and addresses linguistic issues like word segmentation.

Evaluates model on translation, summarization, and headline generation tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained sequence-to-sequence model for Khmer

Incorporates word segmentation and normalization

Outperforms multilingual models in generative tasks

🔎 Similar Papers

No similar papers found.