RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study addresses the low readability of large language models (LLMs) in public health question-answering for non-expert users. To this end, we introduce RephQA—the first domain-specific readability benchmark for public health. Methodologically, we propose a dual-dimensional readability evaluation framework integrating Flesch-Kincaid Grade Level with expert-rated assessments, coupled with an information-preserving proxy task to systematically expose the misalignment between reasoning capability and expressive clarity. We further develop token-adapted group-relative policy optimization (GRPO) and associated prompting/fine-tuning techniques to enhance readability. Extensive experiments across 25 mainstream LLMs reveal that most models fail to meet baseline readability standards (below U.S. 8th-grade level). In contrast, token-adapted GRPO significantly improves readability—yielding more concise, accurate, and user-friendly responses—and effectively bridges the comprehension gap in public health knowledge dissemination.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM response readability for public health questions

Addressing poor readability for non-medical audiences in healthcare QA

Developing strategies to improve LLM communication clarity in health contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-adapted GRPO for readability enhancement

Benchmark with expert-reviewed QA pairs and metrics

Evaluation of 25 LLMs reveals communication gap

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions