PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of existing NLP approaches in effectively capturing Patient Voice (PV) from patient-generated text, which often treat patient-centered communication and social determinants of health in isolation and overlook distinctive linguistic characteristics of patients. To bridge this gap, the authors propose PVminer, a novel framework that, for the first time, integrates patient identity attributes and unsupervised topic modeling into PV detection through a multi-label, multi-class classification task. The approach leverages domain-adapted PV-BERT (base and large variants) and a PV-Topic-BERT architecture that fuses topic representations during both training and inference to enhance semantic understanding. Evaluated on Code, Subcode, and Combo tasks, the models achieve F1 scores of 82.25%, 80.14%, and 77.87%, respectively, substantially outperforming biomedical baselines. Ablation studies confirm the contributions of identity and topic-aware enhancements to model performance.

Technology Category

Application Category

📝 Abstract
Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
Problem

Research questions and friction points this paper is trying to address.

Patient Voice
Patient-Generated Data
Social Determinants of Health
Patient-Centered Communication
Natural Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-adapted NLP
patient voice detection
PV-BERT
topic-augmented classification
multi-label hierarchical prediction
🔎 Similar Papers
No similar papers found.
S
Samah Fodeh
Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA; Department of Biomedical Informatics & Data Science, Yale School of Medicine, 100 College Street, 06510, CT, USA
Linhai Ma
Linhai Ma
Yale University
Deep learningMedical signal/image analysisConcurrency
Yan Wang
Yan Wang
Yale University
Natural Language ProcessingInformation ExtractionText Mining
S
Srivani Talakokkul
Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA
G
Ganesh Puthiaraju
Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA
A
Afshan Khan
Department Of Emergency Medicine, Yale School of Medicine, 464 Congress Ave, 06519, CT, USA
A
Ashley Hagaman
Department of Social and Behavioral Sciences, Yale School of Public Health, 60 College Street, 06520, CT, USA
S
Sarah Lowe
Department of Social and Behavioral Sciences, Yale School of Public Health, 60 College Street, 06520, CT, USA
A
Aimee Roundtree
Division of Research, Texas State University, 601 University Dr., 78666, TX, USA