Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses lemmatization under zero-resource settings—i.e., without domain- or language-specific annotated data—by proposing a large language model (LLM)-based zero-shot in-context learning method. Unlike conventional supervised fine-tuning paradigms, we systematically demonstrate for the first time that LLMs achieve high-accuracy, context-sensitive lemmatization across multilingual and morphologically complex texts using only few-shot prompting—without any parameter updates. Experiments span 12 languages; on most, our approach significantly outperforms encoder-only models fine-tuned across domains, establishing new state-of-the-art performance. Our key contributions are threefold: (1) establishing LLMs as universal lemmatizers, (2) empirically validating their cross-lingual generalization capability without target-language annotations, and (3) drastically reducing reliance on high-quality labeled data—thereby advancing NLP for low-resource languages.

Technology Category

Application Category

📝 Abstract

Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: https://github.com/oltoporkov/lemma-dilemma

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in contextual lemmatization without training data

Comparing in-context lemma generation against supervised cross-lingual methods

Investigating lemmatization performance across 12 morphologically diverse languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs for in-context lemmatization without fine-tuning

Direct lemma generation with few examples per language

Outperforming supervised methods in cross-lingual settings

🔎 Similar Papers

No similar papers found.