Evaluating the Evaluators: Are readability metrics good measures of readability?

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study challenges the validity of conventional readability metrics (e.g., Flesch-Kincaid) for Plain Language Summarization (PLS), as their alignment with human judgments remains empirically unverified. We systematically evaluate the correlation between eight traditional metrics and human readability annotations across multiple PLS benchmarks, finding consistently weak agreement (Pearson ρ < 0.3). Crucially, we provide the first empirical evidence that language model–based metrics—particularly those capturing deeper dimensions such as background knowledge requirements—substantially outperform traditional measures, achieving a maximum Pearson correlation of 0.56 with human judgments. Our findings advocate a paradigm shift in PLS evaluation: from shallow, surface-level statistical heuristics toward semantics-aware modeling. We propose a human–machine consistency–oriented evaluation framework as a new best practice for assessing plain language quality.

Technology Category

Application Category

📝 Abstract
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.
Problem

Research questions and friction points this paper is trying to address.

Evaluating if readability metrics match human judgments
Comparing traditional metrics with language model performance
Assessing readability for plain language summarization effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Models replace traditional readability metrics
LMs better correlate with human readability judgments
LMs capture deeper readability measures like background knowledge
🔎 Similar Papers
No similar papers found.