π€ AI Summary
Clinical timeline extraction from Hebrew electronic health records (EHRs) lacks domain-specific language models, hindering accurate temporal reasoning in clinical NLP. Method: We introduce the first continual pre-trained language model for Hebrew clinical textβbuilt upon DictaBERT 2.0βand propose a novel lexically adaptive tokenization strategy to enhance morphological processing of Hebrew. We construct a de-identified, temporally annotated clinical timeline dataset spanning two domains (internal/emergency medicine and oncology), empirically validating that de-identification preserves downstream performance while ensuring privacy compliance. Contribution/Results: Our model achieves state-of-the-art performance on two newly released Hebrew clinical timeline benchmark datasets. Both the model and dataset are publicly released under ethically reviewed, privacy-preserving protocols, supporting reproducible, compliant research in Hebrew clinical NLP.
π Abstract
We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.