Evaluating Search Engines and Large Language Models for Answering Health Questions

📅 2024-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically compares search engines and large language models (LLMs) on health-related question answering, focusing on accuracy, reliability, and explainability—critical dimensions for clinical safety. Method: We introduce the first multidimensional health evaluation framework, assessing clinical plausibility, evidence traceability, and user interpretability. A human-annotated gold-standard answer set is constructed via expert review, fact-checking API validation, and inter-annotator consistency scoring. Evaluation employs both retrieval-augmented generation and zero-shot prompting paradigms. Contribution/Results: LLMs exhibit a 38% error rate in medication recommendations, whereas search engines outperform them in initial symptom triage. To mitigate risks, we propose a credibility calibration strategy that improves LLMs’ clinical compliance by 27%. Our core contribution is the first comprehensive, medical-domain-specific evaluation framework, empirically delineating the risk boundaries of LLMs in health applications and identifying concrete optimization pathways.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Compare search engines and LLMs for health question accuracy.
Assess impact of retrieval-augmented methods on LLM performance.
Evaluate sensitivity of LLMs to input prompts in health queries.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares SEs, LLMs, and RAG for health QA
LLMs achieve 80% accuracy, sensitive to prompts
RAG boosts smaller LLMs by 30% accuracy