Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations

📅 2024-04-11
🏛️ Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges of data scarcity and inadequate evaluation in automatic lexical simplification for Spanish and Catalan. We introduce the first bilingual lexical complexity dataset featuring scalar understandability difficulty scores—filling a critical gap for Catalan, which previously lacked any annotated resources, and providing the first fine-grained, expert-validated scalar difficulty annotations for Spanish. Methodologically, we integrate multi-round human annotation, linguistic analysis, and ethical risk assessment to establish a robust data curation and evaluation framework that balances quality appropriateness with societal impact. Our contributions are threefold: (1) the release of the first open-source bilingual lexical simplification benchmark dataset; (2) a novel scalar difficulty rating scheme coupled with a multidimensional quality validation protocol; and (3) a systematic identification and formal delineation of data biases, fairness concerns, and ethical boundaries for downstream applications—thereby advancing trustworthy lexical simplification research for low-resource languages.

Technology Category

Application Category

📝 Abstract
Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.
Problem

Research questions and friction points this paper is trying to address.

Develops lexical simplification datasets
Focuses on Spanish and Catalan
Assesses ethical data appropriateness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Creates first Catalan lexical dataset
Introduces Spanish scalar difficulty ratings
Analyzes dataset ethics and appropriateness
🔎 Similar Papers
No similar papers found.
S
Stefan Bott
Universitat Pompeu Fabra, Barcelona, Spain
Horacio Saggion
Horacio Saggion
Chair in Computer Science & Artificial Intelligence, Universitat Pompeu Fabra, DTIC. Head of TALN.
Natural Language ProcessingArtificial IntelligenceComputer Science
N
Nelson Per'ez Rojas
Instituto Tecnológico de Costa Rica, Cartago, Costa Rica
M
Martin Solis Salazar
Instituto Tecnológico de Costa Rica, Cartago, Costa Rica
S
Saúl Calderón Ramírez
Instituto Tecnológico de Costa Rica, Cartago, Costa Rica