CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates performance bottlenecks and improvement strategies for large language models (LLMs) in multi-hop biomedical question answering—tasks requiring complex reasoning across diseases, genes, and chemicals. To address excessive generation and poor short-answer extraction, we propose a two-stage reasoning pipeline incorporating answer structural constraints and output format control. We conduct supervised fine-tuning on LLaMA-3 8B using heterogeneous biomedical QA datasets (BioASQ, MedQuAD, TREC) and systematically analyze the impact of answer length on training efficacy. Experimental results show a 0.8 concept-level accuracy, indicating substantial gains in domain-specific semantic understanding; however, Exact Match remains suboptimal, underscoring the critical importance of output controllability and post-hoc optimization. Our work demonstrates that structured output guidance and targeted data curation significantly enhance LLMs’ fidelity and precision in high-stakes biomedical QA.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges persist in generating strictly formatted outputs. Our findings highlight the gap between semantic understanding and exact answer evaluation in biomedical LLM applications, motivating further research in output control and post-processing strategies.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance on complex biomedical question answering
Improving exact match scores for multi-hop biomedical QA
Addressing output formatting challenges in biomedical LLM applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning with LLaMA 3 8B
Biomedical QA dataset from BioASQ MedQuAD TREC
Two-stage inference pipeline for answer extraction
🔎 Similar Papers
No similar papers found.
Reem Abdel-Salam
Reem Abdel-Salam
MSc student at Faculty of Engineering, Computer Department, Cairo University
Deep learningComputer VisionImage Processing
M
Mary Adewunmi
Menzies School of Health Research, Charles Darwin University, NT, Australia; CaresAI, Australia
M
Modinat A. Abayomi
Department of Biology, Boston College, Massachusetts, USA; CaresAI, Australia