CoAM: Corpus of All-Type Multiword Expressions

📅 2024-12-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing MWE identification datasets suffer from annotation inconsistency, limited coverage of MWE types, and insufficient scale, undermining the reliability and comprehensiveness of task evaluation. To address these issues, we introduce CoAM—the first high-quality, fully type-covered MWE corpus (1.3K sentences)—built via a rigorous pipeline integrating expert-driven manual annotation, domain-expert validation, and automated consistency checking. Crucially, CoAM is the first to incorporate fine-grained part-of-speech–based MWE type labels (e.g., verbal, nominal). We further propose a configurable MWE annotation interface and a multi-stage quality control protocol to ensure flexibility and reproducibility. Leveraging CoAM, fine-tuned large language models substantially outperform the prior SOTA (MWEasWSD) on DiMSUM. Type-aware analysis reveals that nominal MWEs are significantly more challenging to identify than verbal ones—offering novel insights for MWE modeling and evaluation.

Technology Category

Application Category

📝 Abstract

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM's MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

Problem

Research questions and friction points this paper is trying to address.

Inconsistent annotation in existing MWE datasets

Limited coverage of MWE types in current datasets

Lack of fine-grained MWE type tagging for analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step human and automated quality enhancement

First MWE dataset with type tagging

Custom interface for flexible annotation

🔎 Similar Papers

No similar papers found.