Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited generalizability of natural language processing (NLP) models for cancer registry due to heterogeneous pathology report formats across Canadian provinces. It presents the first cross-provincial adaptability evaluation and introduces a privacy-preserving federated weight-sharing framework. By fine-tuning BCCRTron and GatorTron on data from Newfoundland and Labrador and integrating dual-channel inputs—comprising both synthetic and diagnostic text segments—the authors construct a conservative odds-ratio (OR) ensemble model. The approach achieves a recall of 0.99 on both Tier 1 and Tier 2 tasks, reducing missed diagnoses from 48–54 to 24 cases in Tier 1 and from 46–54 to 33 cases in Tier 2, thereby substantially improving registry accuracy and laying the groundwork for a nationally unified foundation model for cancer pathology.

Technology Category

Application Category

📝 Abstract
Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland&Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.
Problem

Research questions and friction points this paper is trying to address.

cross-jurisdiction adaptation
cancer registry
natural language processing
pathology reports
missed cancers
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-jurisdiction adaptation
ensemble NLP
privacy-preserving model sharing
domain-adapted transformers
cancer registry automation
🔎 Similar Papers
No similar papers found.
Jonathan Simkin
Jonathan Simkin
Director, BC Cancer Registry
EpidemiologyMachine LearningNatural Language Processing
L
Lovedeep Gondara
School of Population and Public Health, University of British Columbia, Vancouver, BC
Z
Zeeshan Rizvi
Newfoundland & Labrador Health Services, St. John’s, NL
G
Gregory Doyle
Newfoundland & Labrador Health Services, St. John’s, NL
J
Jeff Dowden
Newfoundland & Labrador Health Services, St. John’s, NL
D
Dan Bond
Newfoundland & Labrador Health Services, St. John’s, NL
D
Desmond Martin
Newfoundland & Labrador Health Services, St. John’s, NL
Raymond Ng
Raymond Ng
University of British Columbia
data mininghealth informaticsgenomicsNLPtext mining