🤖 AI Summary
This work addresses the dual challenges of data scarcity and inadequate evaluation in automatic lexical simplification for Spanish and Catalan. We introduce the first bilingual lexical complexity dataset featuring scalar understandability difficulty scores—filling a critical gap for Catalan, which previously lacked any annotated resources, and providing the first fine-grained, expert-validated scalar difficulty annotations for Spanish. Methodologically, we integrate multi-round human annotation, linguistic analysis, and ethical risk assessment to establish a robust data curation and evaluation framework that balances quality appropriateness with societal impact. Our contributions are threefold: (1) the release of the first open-source bilingual lexical simplification benchmark dataset; (2) a novel scalar difficulty rating scheme coupled with a multidimensional quality validation protocol; and (3) a systematic identification and formal delineation of data biases, fairness concerns, and ethical boundaries for downstream applications—thereby advancing trustworthy lexical simplification research for low-resource languages.
📝 Abstract
Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.