Exploring the Word Sense Disambiguation Capabilities of Large Language Models

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the under-evaluation of large language models (LLMs) on word sense disambiguation (WSD), a fundamental lexical semantic task. We propose the first LLM-native, two-stage benchmark—sense definition generation and candidate sense selection—designed to holistically assess both generative and discriminative capabilities of LLMs. Leveraging XL-WSD and BabelNet, we construct a novel multilingual, multi-sense dataset, enabling the first systematic adaptation of classical WSD to the LLM paradigm. We conduct comprehensive evaluations across zero-shot prompting and lightweight supervised fine-tuning on diverse open- and closed-source LLMs. Results show that while zero-shot LLMs demonstrate robustness, they do not surpass traditional state-of-the-art (SOTA) methods. In contrast, medium-scale LLMs with minimal fine-tuning achieve new SOTA on both subtasks, underscoring fine-tuning’s critical role in unlocking LLMs’ fine-grained semantic understanding. This work establishes a new evaluation paradigm and provides empirical foundations for advancing both WSD research and LLM assessment.

Technology Category

Application Category

📝 Abstract
Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a previous benchmark (XL-WSD) to re-design two subtasks suitable for LLM: 1) given a word in a sentence, the LLM must generate the correct definition; 2) given a word in a sentence and a set of predefined meanings, the LLM must select the correct one. The extended benchmark is built using the XL-WSD and BabelNet. The results indicate that LLMs perform well in zero-shot learning but cannot surpass current state-of-the-art methods. However, a fine-tuned model with a medium number of parameters outperforms all other models, including the state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on Word Sense Disambiguation tasks.
Extending XL-WSD benchmark for LLM-specific subtasks.
Assessing zero-shot learning vs. fine-tuned model performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended XL-WSD benchmark for LLM evaluation
LLMs tested on definition generation and selection
Fine-tuned medium-parameter model outperforms state-of-the-art
🔎 Similar Papers
No similar papers found.