Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the scarcity of annotated Finnish medical texts by proposing a method to predict the benefit of unsupervised pretraining for downstream tasks through analyzing geometric changes in the embedding space during domain-adaptive fine-tuning. Specifically, we adapt FinBERT to Finnish pathology reports and investigate the relationship between the dynamics of embedding geometry and supervised classification performance. Our experiments demonstrate that the evolution of the embedding space in early fine-tuning stages effectively predicts the final model performance. This finding offers a practical approach for evaluating the efficacy of fine-tuning in medical AI settings—where acquiring labeled data is costly—without requiring annotated examples upfront.

Technology Category

Application Category

📝 Abstract

In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.

Problem

Research questions and friction points this paper is trying to address.

domain fine-tuning

low-resource NLP

medical text classification

Finnish BERT

label scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain fine-tuning

embedding geometry

downstream correlation