Two-layer retrieval augmented generation framework for low-resource medical question-answering using Reddit data: Proof of concept (Preprint)

📅 2024-05-29
🏛️ Journal of Medical Internet Research
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical practitioners require real-time, evidence-informed answers to questions about novel therapeutics (e.g., xylazine, ketamine), yet existing RAG systems struggle to distill actionable clinical insights from noisy, unstructured social media text. Method: We propose a two-tier retrieval-augmented generation (RAG) framework that processes large-scale user-generated content from platforms like Reddit. The first tier generates fine-grained, post-level summaries; the second tier synthesizes these into global, clinically oriented aggregate summaries. We employ the quantized lightweight model Nous-Hermes-2-7B-DPO to enable efficient, low-resource deployment. Contribution/Results: Our framework introduces the first dual-layer summarization mechanism explicitly designed to balance semantic fidelity and clinical utility. Evaluation across 20 clinical questions and 76 samples shows no statistically significant difference (p > 0.05) versus GPT-4 in relevance, coverage, coherence, and hallucination mitigation—demonstrating robust performance in resource-constrained clinical settings.

Technology Category

Application Category

📝 Abstract
The increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging. This paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians' queries on emerging issues associated with health-related topics, using user-generated medical information on social media. We proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians' questions on the use of xylazine and ketamine. Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between the two for coverage, coherence, relevance, length, and hallucination. A statistically significant difference was noted for the Coleman-Liau Index. Our RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Social Media Mining
Health Information Retrieval
Computational Resource Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step Enhancement Method
Drug Information Retrieval
Social Media Utilization
🔎 Similar Papers
No similar papers found.
Sudeshna Das
Sudeshna Das
Associate Prof. of Neurology Harvard Medical School
Bioinformatics
Yao Ge
Yao Ge
National Institutes of Health (NIH)
Natural Language ProcessingInformation ExtractionBiomedical Informatics
Y
Yuting Guo
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
S
Swati Rajwal
Department of Computer Science and Informatics, Emory University, Atlanta, GA, USA
J
JaMor M. Hairston
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
J
Jeanne Powell
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
D
Drew Walker
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
S
Snigdha Peddireddy
Department of Behavioral, Social, & Health Education Sciences, Rollins School of Public Health, Emory University, Atlanta, GA, USA
S
S. Lakamana
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
S
Selen Bozkurt
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
M
Matthew Reyna
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
R
Reza Sameni
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
Yunyu Xiao
Yunyu Xiao
Weill Cornell Medicine | NewYork-Presbyterian. Department of Population Health Sciences | Health
SuicideMental HealthHealth DisparitiesHealth Data Science
S
Sangmi Kim
Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, GA, USA
R
Rasheeta D. Chandler
Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, GA, USA
N
Natalie Hernandez
Center for Maternal Health Equity, Morehouse School of Medicine, Atlanta, GA, USA
D
Danielle Mowery
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, PA, USA
R
R. Wightman
Department of Emergency Medicine, Warren Alpert Medical School of Brown University, Providence, RI, USA
J
Jennifer Love
Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
A
A. Spadaro
Department of Emergency Medicine, Rutgers New Jersey Medical School, Newark, NJ, USA
J
Jeanmarie Perrone
Department of Emergency Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
Abeed Sarker
Abeed Sarker
Emory University School of Medicine
Natural Language ProcessingBiomedical InformaticsHealth Data ScienceApplied Machine Learning