MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the challenges of catastrophic forgetting and the lack of standardized benchmarks in continual learning for medical language models. To this end, we introduce MedCL-Bench, the first reproducible, multi-task continual learning evaluation framework tailored to the biomedical domain, integrating ten datasets spanning five task types and systematically evaluating eleven continual learning strategies across eight task sequences. Our experiments reveal that naive fine-tuning suffers severe forgetting; parameter isolation methods achieve the best performance retention per GPU-hour, while experience replay offers strong protection at high computational cost, and regularization yields limited benefits. Multi-label classification tasks are most susceptible to forgetting, whereas constrained-output tasks exhibit greater robustness. This work uncovers critical trade-offs between stability and efficiency and highlights task-dependent forgetting patterns.

Technology Category

Application Category

📝 Abstract

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

biomedical NLP

benchmark

model updating

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual learning

biomedical NLP

catastrophic forgetting