Converting Annotated Clinical Cases into Structured Case Report Forms

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

High-quality, bilingual annotated Case Report Form (CRF) datasets are critically scarce in clinical research, hindering the development of CRF slot-filling models. To address this, we propose the first cross-task data transfer paradigm for CRF generation, leveraging a semi-automatic, rule- and large language model (LLM)-guided approach to transform the existing clinical information extraction corpus (E3C) into a high-quality English–Italian bilingual CRF slot-filling dataset. This constitutes the first publicly available, standardized, bilingual CRF benchmark. Under zero-shot evaluation, closed-source LLMs achieve 67.3% F1 on English and 59.7% F1 on Italian, confirming both task difficulty and dataset efficacy. Our work bridges critical gaps in bilingual medical structured understanding—both in data resources and methodology—and establishes a novel paradigm for controllable mapping from clinical text to structured forms.

Technology Category

Application Category

📝 Abstract

Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, wellannotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs. We release the datest at https://huggingface.co/collections/NLP-FBK/e3c-to-crf-67b9844065460cbe42f80166

Problem

Research questions and friction points this paper is trying to address.

Convert annotated clinical cases into structured CRFs

Address scarcity of publicly available CRF datasets

Evaluate CRF slot filling performance across languages and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic conversion of annotated datasets

Creation of multilingual CRF datasets

Evaluation of LLMs on CRF slot filling

🔎 Similar Papers

No similar papers found.