Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This work addresses the challenge that conventional subword models struggle to effectively capture orthographic similarities and morphological variations in low-resource and morphologically complex languages. To overcome this limitation, the authors propose Rich Character Embeddings (RCE), a method that directly processes character sequences as input—eliminating the need for tokenization or subword segmentation—and generates word representations enriched with both semantic and syntactic information. Built upon a Transformer architecture, RCE innovatively integrates convolutional mechanisms to form a hybrid model capable of producing plug-and-play word embeddings. Experimental results demonstrate that RCE significantly outperforms baseline models on tasks including SWAG, inflection prediction, and detection of metaphors and interleaved structures, with particularly strong performance under data-scarce conditions, as evidenced by superior scores on OddOneOut and TopK metrics.

Technology Category

Application Category

📝 Abstract

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

morphologically complex languages

subtoken limitations

orthographic similarity

morphological variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rich Character Embeddings

morphologically complex languages

low-resource NLP