🤖 AI Summary
This study addresses the scarcity of annotated Finnish medical texts by proposing a method to predict the benefit of unsupervised pretraining for downstream tasks through analyzing geometric changes in the embedding space during domain-adaptive fine-tuning. Specifically, we adapt FinBERT to Finnish pathology reports and investigate the relationship between the dynamics of embedding geometry and supervised classification performance. Our experiments demonstrate that the evolution of the embedding space in early fine-tuning stages effectively predicts the final model performance. This finding offers a practical approach for evaluating the efficacy of fine-tuning in medical AI settings—where acquiring labeled data is costly—without requiring annotated examples upfront.
📝 Abstract
In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims.
(1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data.
(2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning.
Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.