🤖 AI Summary
This study systematically evaluates the accuracy, safety, and accessibility of large language models (LLMs) in health communication for breast and cervical cancer. To address the lack of rigorous, multidimensional assessment frameworks, we propose the first hybrid evaluation methodology integrating quantitative metrics, qualitative clinical expert scoring, and robust statistical tests—including Welch’s ANOVA, Games-Howell post-hoc analysis, and Hedges’ *g* effect size estimation. Results reveal that general-purpose LLMs outperform domain-specific models in linguistic quality and engagement, whereas medical LLMs—though significantly improving accessibility through terminology simplification and structural clarity—exhibit systematic increases in toxicity, bias, and potential clinical risk. This trade-off exposes a fundamental tension between domain-knowledge integration and safety constraints. Our work establishes a methodological benchmark for responsible deployment of medical LLMs and provides empirically grounded warnings for clinical AI development.
📝 Abstract
Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch's ANOVA, Games-Howell, and Hedges' g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.