HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant factual inaccuracies and safety risks when delivering harm reduction health information to people who use drugs (PWUD), posing critical challenges for public health applications. Method: We introduce HRIP-Bench—the first domain-specific evaluation benchmark for PWUD health needs—comprising 2,160 question-answer pairs across three tasks: safety boundary judgment, quantitative parameter generation, and multi-substance interaction risk inference. We propose a novel framework integrating instruction tuning with retrieval-augmented generation (RAG), grounded in authoritative medical knowledge sources, to systematically assess model behavior under intrinsic and external knowledge collaboration. Results: State-of-the-art LLMs demonstrate substantial factual errors and elevated safety risks in critical harm reduction domains, underscoring the necessity of domain-adaptive optimization. This work establishes a methodological foundation and empirical evidence for the trustworthy deployment of LLMs in sensitive public health contexts.

Technology Category

Application Category

📝 Abstract

Millions of individuals' well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM's accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' accuracy in harm reduction information for drug users

Assessing safety risks of LLMs in providing drug-related advice

Benchmarking LLMs on tasks like safety boundaries and polysubstance risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

HRIPBench benchmark for LLM evaluation

Instruction and RAG schemes integration

Safety and accuracy assessment framework

🔎 Similar Papers

No similar papers found.