A linguistically-motivated evaluation methodology for unraveling model’s abilities in reading comprehension tasks

📅 2025-01-29

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Large language models (LLMs) exhibit systematic failures in reading comprehension due to linguistic complexity, yet existing evaluation paradigms lack interpretability and fail to pinpoint underlying linguistic deficits. Method: We propose a semantics-based, attributable evaluation framework inspired by FrameNet, introducing seven fine-grained, quantifiable linguistic complexity factors—including argument structure and tense embedding—as predictive variables for model failure. To ensure cross-lingual applicability, we employ expert French annotation and ChatGPT-assisted English annotation. Contribution/Results: Statistical analysis reveals that two complexity factors significantly predict failure on the French benchmark (p < 0.01). English experiments demonstrate that state-of-the-art models remain brittle on specific syntactic–semantic constructions, indicating that scaling parameters alone cannot substitute for deep linguistic competence. Our approach breaks from black-box evaluation, offering an interpretable, generalizable framework for diagnosing fine-grained language capabilities.

Technology Category

Application Category

📝 Abstract

We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model’s difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models’ failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using chatGPT as a proxy for semantic annotation.Our study reveals that fine-grained linguistically-motivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models’ abilities to handle specific linguistic characteristics of input examples. It also shows that current state-of-the-art models fail with some for those characteristics which suggests that adequately handling them requires more than merely increasing model size.

Problem

Research questions and friction points this paper is trying to address.

Reading Comprehension

Performance Limitations

Language Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Model Evaluation

Chat-GPT Automated Analysis

Reading Comprehension Challenges

🔎 Similar Papers

No similar papers found.