Reading Between the Lines: A dataset and a study on why some texts are tougher than others

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of identifying and modeling text readability difficulties for individuals with cognitive impairments, such as reading comprehension deficits or conceptual understanding difficulties. Method: We propose the first multidimensional, sentence-level text difficulty annotation framework tailored to cognitively impaired users, grounded in psychological and translation-theoretic principles. Based on this framework, we construct the first standardized parallel corpus of English and Easy-to-Read English with fine-grained difficulty annotations. Using this dataset, we fine-tune four Transformer-based models—including BERT—and train them to classify text simplification strategies. Attention visualization and attribution analysis ensure model interpretability. Contribution/Results: The best-performing model achieves 89.2% accuracy on simplification strategy classification. This work provides the first systematic empirical validation of pretrained language models’ capacity to assess cognitive accessibility of text—both effectively and transparently—thereby establishing theoretical foundations and technical infrastructure for accessible information processing.

Technology Category

Application Category

📝 Abstract
Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from https://github.com/Nouran-Khallaf/why-tough
Problem

Research questions and friction points this paper is trying to address.

Learning Difficulties
Reading Comprehension
Cognitive Accessibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Psycholinguistics
Text Simplification
Machine Learning Models
🔎 Similar Papers
No similar papers found.