MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Subword segmentation for low-resource, morphologically complex languages—such as Ge’ez—often fails to preserve morphological boundaries, undermining linguistic fidelity and token efficiency. Method: This paper proposes MoVoC, a hybrid tokenization method integrating supervised morphological analysis with Byte-Pair Encoding (BPE). It is the first to incorporate supervised morphological analysis into subword vocabulary construction for Ge’ez. Contribution/Results: We release manually annotated morpheme datasets for four languages and morphology-aware vocabularies for two. The resulting MoVoC-Tok tokenizer achieves a superior balance between morphological integrity and lexical expressivity. Intrinsic evaluation shows significant improvements over baselines in MorphoScore (+12.4) and boundary precision (+9.7%). Although machine translation gains are marginal, MoVoC demonstrates substantial advantages in linguistic fidelity and token efficiency—validating the efficacy of morphology-aware tokenization for morphologically rich, low-resource languages.

Technology Category

Application Category

📝 Abstract

Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC

Problem

Research questions and friction points this paper is trying to address.

Preserving morphological boundaries in subword tokenization

Addressing low-resource Geez script language challenges

Integrating morphological analysis into vocabulary construction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Morphology-aware subword vocabulary construction

Hybrid morpheme-based and BPE tokenization

Manually annotated morpheme datasets creation

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer