Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations

📅 2024-04-11

🏛️ Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

📈 Citations: 2

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the dual challenges of data scarcity and inadequate evaluation in automatic lexical simplification for Spanish and Catalan. We introduce the first bilingual lexical complexity dataset featuring scalar understandability difficulty scores—filling a critical gap for Catalan, which previously lacked any annotated resources, and providing the first fine-grained, expert-validated scalar difficulty annotations for Spanish. Methodologically, we integrate multi-round human annotation, linguistic analysis, and ethical risk assessment to establish a robust data curation and evaluation framework that balances quality appropriateness with societal impact. Our contributions are threefold: (1) the release of the first open-source bilingual lexical simplification benchmark dataset; (2) a novel scalar difficulty rating scheme coupled with a multidimensional quality validation protocol; and (3) a systematic identification and formal delineation of data biases, fairness concerns, and ethical boundaries for downstream applications—thereby advancing trustworthy lexical simplification research for low-resource languages.

Technology Category

Application Category

📝 Abstract

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

Problem

Research questions and friction points this paper is trying to address.

Develops lexical simplification datasets

Focuses on Spanish and Catalan

Assesses ethical data appropriateness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Creates first Catalan lexical dataset

Introduces Spanish scalar difficulty ratings

Analyzes dataset ethics and appropriateness

🔎 Similar Papers

No similar papers found.