Towards Artificial Intelligence Research Assistant for Expert-Involved Learning

📅 2025-05-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The reliability and practical utility of large language models (LLMs) and large multimodal models (LMMs) in biomedical text summarization and figure understanding remain insufficiently evaluated. Method: We introduce ARIEL—the first expert-curated, multimodal benchmark for biomedical research papers—featuring two core tasks: paper abstract generation and biomedical figure reasoning. ARIEL integrates doctoral-level human evaluation, systematic prompt engineering, supervised fine-tuning, and test-time scaling. We further propose an LMM Agent framework for scientific hypothesis generation and design an expert-collaborative evaluation paradigm. Contribution/Results: Our optimized methods significantly outperform human expert–corrected baselines in both summary accuracy and figure reasoning. ARIEL systematically characterizes the capability boundaries of mainstream LLMs/LMMs in biomedical domains, providing a reproducible benchmark and actionable optimization pathways for real-world deployment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present extbf{AR}tificial extbf{I}ntelligence research assistant for extbf{E}xpert-involved extbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.
Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of LLMs/LMMs in biomedical applications
Enhancing text summarization and figure interpretation in biomedicine
Benchmarking models with expert evaluations and improving performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset for biomedical text and figure analysis
Expert-driven human evaluations for model benchmarking
Prompt engineering and fine-tuning for performance improvement
🔎 Similar Papers
No similar papers found.
T
Tianyu Liu
Yale University
S
Simeng Han
Google DeepMind
X
Xiao Luo
University of California, Los Angeles
H
Hanchen Wang
Genentech
P
Pan Lu
Stanford University
B
Biqing Zhu
Yale University
Y
Yuge Wang
Yale University
K
Keyi Li
Yale University
J
Jiapeng Chen
Yale University
R
Rihao Qu
Yale University
Y
Yufeng Liu
Yale University
Xinyue Cui
Xinyue Cui
University of Southern California
machine learningnatural language processing
Aviv Yaish
Aviv Yaish
Yale University, IC3
EconomicsSecurityDecentralized FinanceBlockchainsDistributed Systems
Y
Yuhang Chen
Yale University
Minsheng Hao
Minsheng Hao
Tsinghua University
bioinformaticsmachine learning
C
Chuhan Li
Yale University
K
Kexing Li
Yale University
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence
H
Hua Xu
Yale University
Mark Gerstein
Mark Gerstein
Professor of Biomedical Informatics, Yale University
Bioinformatics
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
Hongyu Zhao
Hongyu Zhao
Yale University
First interestSecond interest