CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

📅 2024-09-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study addresses the critical gaps in safety and factual accuracy of large language models (LLMs) in health consultation. To this end, we introduce CHBench—the first safety-oriented Chinese health evaluation benchmark—comprising 9,492 high-quality, human-annotated question-answer pairs spanning physiological and psychological health domains. We propose a multidimensional, risk-aware evaluation framework that systematically assesses LLMs across key dimensions: factual correctness, clinical guideline compliance, and potential harm identification. Empirical evaluation of four leading Chinese LLMs reveals pervasive failure patterns, including factual inaccuracies, inappropriate medical recommendations, and safety vulnerabilities. CHBench fills a longstanding void in the safety assessment of Chinese medical LLMs and establishes a standardized, quantitatively grounded benchmark to support the development, evaluation, and improvement of trustworthy AI for healthcare applications.

Technology Category

Application Category

📝 Abstract

With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. The use of these models in real-world contexts-where misinformation can lead to serious consequences for individuals seeking medical advice and support-necessitates a rigorous focus on safety and trustworthiness. In this work, we introduce CHBench, the first comprehensive safety-oriented Chinese health-related benchmark designed to evaluate LLMs' capabilities in understanding and addressing physical and mental health issues with a safety perspective across diverse scenarios. CHBench comprises 6,493 entries on mental health and 2,999 entries on physical health, spanning a wide range of topics. Our extensive evaluations of four popular Chinese LLMs highlight significant gaps in their capacity to deliver safe and accurate health information, underscoring the urgent need for further advancements in this critical domain. The code is available at https://github.com/TracyGuo2001/CHBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' health-related performance in Chinese.

Assessing safety and accuracy in health information delivery.

Creating a benchmark for mental and physical health issues.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese health benchmark

safety-oriented evaluation

large language models

🔎 Similar Papers

No similar papers found.