🤖 AI Summary
This study addresses the under-resourced language definition generation task, using Belarusian as a case study to alleviate the lack of automated definition support in lexicography. We introduce the first Belarusian definition dataset comprising 43,150 entries and propose a few-shot transfer framework based on pretrained language models, integrating context-aware decoding with lightweight adaptation. Experiments demonstrate that high-quality definitions can be generated using only ~1,000 annotated examples—substantially outperforming zero-shot baselines. Moreover, we reveal systematic discrepancies between mainstream automatic metrics (e.g., BLEU, BERTScore) and human evaluation. Our contributions are threefold: (1) establishing the first benchmark dataset for Belarusian definition generation; (2) empirically validating the feasibility of few-shot definition modeling for low-resource languages; and (3) highlighting critical limitations of automatic evaluation, thereby providing both methodological insights and empirical foundations for cross-lingual lexicographic research.
📝 Abstract
Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.