🤖 AI Summary
This study systematically compares search engines and large language models (LLMs) on health-related question answering, focusing on accuracy, reliability, and explainability—critical dimensions for clinical safety. Method: We introduce the first multidimensional health evaluation framework, assessing clinical plausibility, evidence traceability, and user interpretability. A human-annotated gold-standard answer set is constructed via expert review, fact-checking API validation, and inter-annotator consistency scoring. Evaluation employs both retrieval-augmented generation and zero-shot prompting paradigms. Contribution/Results: LLMs exhibit a 38% error rate in medication recommendations, whereas search engines outperform them in initial symptom triage. To mitigate risks, we propose a credibility calibration strategy that improves LLMs’ clinical compliance by 27%. Our core contribution is the first comprehensive, medical-domain-specific evaluation framework, empirically delineating the risk boundaries of LLMs in health applications and identifying concrete optimization pathways.