🤖 AI Summary
The reliability and practical utility of large language models (LLMs) and large multimodal models (LMMs) in biomedical text summarization and figure understanding remain insufficiently evaluated. Method: We introduce ARIEL—the first expert-curated, multimodal benchmark for biomedical research papers—featuring two core tasks: paper abstract generation and biomedical figure reasoning. ARIEL integrates doctoral-level human evaluation, systematic prompt engineering, supervised fine-tuning, and test-time scaling. We further propose an LMM Agent framework for scientific hypothesis generation and design an expert-collaborative evaluation paradigm. Contribution/Results: Our optimized methods significantly outperform human expert–corrected baselines in both summary accuracy and figure reasoning. ARIEL systematically characterizes the capability boundaries of mainstream LLMs/LMMs in biomedical domains, providing a reproducible benchmark and actionable optimization pathways for real-world deployment.
📝 Abstract
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present extbf{AR}tificial extbf{I}ntelligence research assistant for extbf{E}xpert-involved extbf{L}earning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.