KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of balancing accuracy and efficiency in medical multi-label text classification (MLTC), alongside high terminological granularity and strict HIPAA compliance requirements, this paper proposes a three-stage lightweight framework: (1) knowledge distillation from BERT to DistilBERT, (2) sequential fine-tuning, and (3) particle swarm optimization (PSO) for hyperparameter search. It is the first work to achieve semantic-faithful knowledge transfer from large-scale models to lightweight counterparts specifically for medical MLTC, while enabling on-premises deployment to ensure data privacy. Evaluated on the Hallmarks of Cancer benchmark datasets, the framework achieves an F1-score of 82.70%, significantly outperforming established baselines. Ablation studies and statistical significance tests confirm its robustness. Moreover, inference latency is reduced by 63%, enabling real-time execution on edge devices.

Technology Category

Application Category

📝 Abstract
The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC), a framework leveraging model compression and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi-Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine-tuning, subsequently optimized through Particle Swarm Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher's learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The experiments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70%. Additionally, statistical validation and an ablation study are carried out, proving the robustness of KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process allowed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource-constrained healthcare settings with satisfactory accuracy demands.
Problem

Research questions and friction points this paper is trying to address.

Efficient multi-label classification for healthcare text data
Knowledge distillation from complex to lightweight language models
HIPAA-compliant local processing of sensitive medical information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses knowledge distillation for model compression
Integrates sequential fine-tuning with PSO optimization
Transfers BERT knowledge to lighter DistilBERT model
🔎 Similar Papers
No similar papers found.
Hajar Sakai
Hajar Sakai
Ph.D. in Industrial and Systems Engineering
Large Language ModelsText ClassificationTime Series Forecasting
S
Sarah S. Lam
School of Systems Science and Industrial Engineering, State University of University at Binghamton