Determinants of Training Corpus Size for Clinical Text Classification

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the arbitrary selection of training corpus size in clinical text classification, which often stems from annotation costs and lacks consideration of lexical characteristics. Leveraging the MIMIC-III dataset, the authors systematically evaluate model performance across ten diagnostic tasks using BERT embeddings with a random forest classifier under varying training set sizes. To quantify the impact of vocabulary on learning curves, they employ Lasso logistic regression on bag-of-words features to identify strong predictive words and noise words. Their findings reveal that approximately 600 documents suffice to achieve 95% of the performance attainable with 10,000 documents; moreover, every additional 100 strong predictive words increase peak accuracy by about 0.04, whereas each additional 100 noise words reduce it by roughly 0.02. This work provides empirical evidence and quantitative guidance for sample size planning in clinical NLP tasks.

Technology Category

Application Category

📝 Abstract
Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.
Problem

Research questions and friction points this paper is trying to address.

training corpus size
clinical text classification
sample size justification
vocabulary properties
NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

training corpus size
clinical text classification
vocabulary analysis
learning curves
predictive words
🔎 Similar Papers
No similar papers found.
J
Jaya Chaturvedi
King’s College London, United Kingdom
S
Saniya Deshpande
King’s College London, United Kingdom
C
Chenkai Ma
King’s College London, United Kingdom
R
Robert Cobb
King’s College London, United Kingdom
Angus Roberts
Angus Roberts
King's College London
Natural Language Processing
Robert Stewart
Robert Stewart
Professor of Psychiatric Epidemiology and Clinical Informatics, King's College London
Psychiatric EpidemiologyClinical InformaticsOld Age PsychiatryInternational Mental Health
Daniel Stahl
Daniel Stahl
Department of Biostatistics and Health Informatics, IoPPN, King's College London
StatisticsMachine/Statistical LearningPrediction modelingCausal modellingClinical Trials
D
Diana Shamsutdinova
King’s College London, United Kingdom