FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
French oncology NLP is hindered by a scarcity of high-quality annotated corpora. To address this, we introduce the first large-scale, expert-annotated French clinical oncology corpus, comprising 1,301 synthetically generated oncology cases. Entities—including histology, anatomical site, and degree of differentiation—are systematically annotated and normalized via an innovative, multi-layer composite expression scheme aligned with ICD-O-3.1: multidimensional ICD-O elements are mapped to unified clinical concepts. Annotation employs a two-stage “automated matching + manual verification” pipeline: two oncology experts delineate entity spans, while a five-expert panel collaboratively assigns normalized ICD-O codes to ensure semantic fidelity. The resulting dataset covers 399 histology codes, 272 anatomical site codes, and 2,043 composite expressions, yielding 71,127 ICD-O-normalized entries. This resource constitutes the first benchmark for named entity recognition and concept normalization in French oncology text.

Technology Category

Application Category

📝 Abstract
Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.
Problem

Research questions and friction points this paper is trying to address.

Creating French annotated corpus for oncology NLP tasks
Normalizing oncological entities using ICD-O-3.1 terminology standards
Providing reference dataset for entity recognition and concept normalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created expert-annotated French oncology corpus
Used ICD-O classification for entity normalization
Combined automated matching with manual validation
🔎 Similar Papers
No similar papers found.
J
Johann Pignat
Service des sciences de l’information médicale, Hôpitaux Universitaires de Genève, Suisse
M
Milena Vucetic
Service des sciences de l’information médicale, Hôpitaux Universitaires de Genève, Suisse
C
Christophe Gaudet-Blavignac
Service des sciences de l’information médicale, Hôpitaux Universitaires de Genève, Suisse
J
Jamil Zaghir
Service des sciences de l’information médicale, Hôpitaux Universitaires de Genève, Suisse
A
Amandine Stettler
Service des sciences de l’information médicale, Hôpitaux Universitaires de Genève, Suisse
F
Fanny Amrein
Service d’oncologie de précision, Hôpitaux Universitaires de Genève, Suisse
J
Jonatan Bonjour
Service d’oncologie de précision, Hôpitaux Universitaires de Genève, Suisse
Jean-Philippe Goldman
Jean-Philippe Goldman
University of Geneva
speechprosodyNLPcrowdsourcingdigital humanities
O
Olivier Michielin
Service d’oncologie de précision, Hôpitaux Universitaires de Genève, Suisse
Christian Lovis
Christian Lovis
Professor, university of Geneva and university hospitals of Geneva
clinical information systemssemanticsdata interoperability and semanticshealth bigdata and
Mina Bjelogrlic
Mina Bjelogrlic
University of Geneva
Artificial IntelligenceBig Data