Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the limited generalizability of natural language processing (NLP) models for cancer registry due to heterogeneous pathology report formats across Canadian provinces. It presents the first cross-provincial adaptability evaluation and introduces a privacy-preserving federated weight-sharing framework. By fine-tuning BCCRTron and GatorTron on data from Newfoundland and Labrador and integrating dual-channel inputs—comprising both synthetic and diagnostic text segments—the authors construct a conservative odds-ratio (OR) ensemble model. The approach achieves a recall of 0.99 on both Tier 1 and Tier 2 tasks, reducing missed diagnoses from 48–54 to 24 cases in Tier 1 and from 46–54 to 33 cases in Tier 2, thereby substantially improving registry accuracy and laying the groundwork for a nationally unified foundation model for cancer pathology.

Technology Category

Application Category

📝 Abstract

Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland&Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.

Problem

Research questions and friction points this paper is trying to address.

cross-jurisdiction adaptation

cancer registry

natural language processing

pathology reports

missed cancers

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-jurisdiction adaptation

ensemble NLP

privacy-preserving model sharing