IMB: An Italian Medical Benchmark for Question Answering

๐Ÿ“… 2025-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges posed by informal, high-noise text in non-English online medical forums for question-answering (QA) systems, this work introduces IMB-QAโ€”a large-scale Italian medical QA benchmark comprising 780K patient-doctor dialoguesโ€”and IMB-MCQA, a multiple-choice QA dataset with 25K items. We further propose a text clarification preprocessing method tailored to noisy, community-sourced medical forum data. Methodologically, we integrate domain-specific fine-tuning, retrieval-augmented generation (RAG), and lightweight model adaptation. Experimental results demonstrate that domain adaptation combined with retrieval augmentation significantly outperforms scaling model size alone; specifically, compact, domain-adapted models surpass general-purpose large language models on medical QA tasks. All datasets, code, and training frameworks are publicly released, establishing foundational infrastructure for multilingual medical AI research.

Technology Category

Application Category

๐Ÿ“ Abstract
Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: extbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and extbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.
Problem

Research questions and friction points this paper is trying to address.

Addressing automated question answering challenges in Italian medical forums
Improving clarity and consistency of medical forum data using LLMs
Evaluating specialized adaptation strategies for medical question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created Italian medical benchmarks for QA tasks
Used LLMs to enhance medical forum data clarity
Applied RAG and fine-tuning for domain adaptation
๐Ÿ”Ž Similar Papers
No similar papers found.