VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the critical scarcity of high-quality annotated data in scientific literature, which severely limits the ability of artificial intelligence to perform information extraction in complex, open-ended domains such as virology. To this end, the authors propose VILLA, a multi-step retrieval-augmented generation (RAG) framework specifically designed for extracting influenza virus mutation information. They introduce a novel domain-specific benchmark dataset comprising 239 scientific papers and 629 curated mutation instances—the first of its kind. By effectively integrating large language models with contextual scientific literature, VILLA achieves precise mutation extraction and significantly outperforms both standard RAG approaches and state-of-the-art agent-based methods, establishing a new paradigm for complex scientific information extraction tasks.

Technology Category

Application Category

📝 Abstract

The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curate a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 239 scientific publications to serve as ground truth for the mutation extraction task. Finally, we demonstrate VILLA's superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools for SIE.

Problem

Research questions and friction points this paper is trying to address.

scientific information extraction

large language models

open-ended tasks

virology

mutation extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation (RAG)

Scientific Information Extraction

Open-ended Task