🤖 AI Summary
This study addresses the low readability of large language models (LLMs) in public health question-answering for non-expert users. To this end, we introduce RephQA—the first domain-specific readability benchmark for public health. Methodologically, we propose a dual-dimensional readability evaluation framework integrating Flesch-Kincaid Grade Level with expert-rated assessments, coupled with an information-preserving proxy task to systematically expose the misalignment between reasoning capability and expressive clarity. We further develop token-adapted group-relative policy optimization (GRPO) and associated prompting/fine-tuning techniques to enhance readability. Extensive experiments across 25 mainstream LLMs reveal that most models fail to meet baseline readability standards (below U.S. 8th-grade level). In contrast, token-adapted GRPO significantly improves readability—yielding more concise, accurate, and user-friendly responses—and effectively bridges the comprehension gap in public health knowledge dissemination.
📝 Abstract
Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.