🤖 AI Summary
This work addresses the under-evaluation of large language models (LLMs) on word sense disambiguation (WSD), a fundamental lexical semantic task. We propose the first LLM-native, two-stage benchmark—sense definition generation and candidate sense selection—designed to holistically assess both generative and discriminative capabilities of LLMs. Leveraging XL-WSD and BabelNet, we construct a novel multilingual, multi-sense dataset, enabling the first systematic adaptation of classical WSD to the LLM paradigm. We conduct comprehensive evaluations across zero-shot prompting and lightweight supervised fine-tuning on diverse open- and closed-source LLMs. Results show that while zero-shot LLMs demonstrate robustness, they do not surpass traditional state-of-the-art (SOTA) methods. In contrast, medium-scale LLMs with minimal fine-tuning achieve new SOTA on both subtasks, underscoring fine-tuning’s critical role in unlocking LLMs’ fine-grained semantic understanding. This work establishes a new evaluation paradigm and provides empirical foundations for advancing both WSD research and LLM assessment.
📝 Abstract
Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a previous benchmark (XL-WSD) to re-design two subtasks suitable for LLM: 1) given a word in a sentence, the LLM must generate the correct definition; 2) given a word in a sentence and a set of predefined meanings, the LLM must select the correct one. The extended benchmark is built using the XL-WSD and BabelNet. The results indicate that LLMs perform well in zero-shot learning but cannot surpass current state-of-the-art methods. However, a fine-tuned model with a medium number of parameters outperforms all other models, including the state-of-the-art.