CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses the gap in evaluating large language models’ (LLMs) multi-task, long-context scientific reasoning capabilities within authentic research settings. To this end, we introduce CURIE—the first interdisciplinary, long-context scientific reasoning benchmark—spanning six domains: materials science, quantum computing, protein sequences, and others, comprising 580 expert-annotated tasks. CURIE uniquely assesses LLMs’ ability to integrate domain-specific knowledge, perform cross-disciplinary reasoning, and navigate dual experimental-theoretical workflows. We conduct zero-shot and few-shot evaluations using state-of-the-art models—including Gemini Flash 2.0, Claude-3, GPT-4o, and command-R+—leveraging expert annotations, multi-granularity scoring, and a reproducible evaluation framework. Results reveal that even top-performing models achieve only 32% overall accuracy, with pronounced bottlenecks in protein sequence reasoning. Gemini Flash 2.0 and Claude-3 demonstrate superior robustness across tasks. These findings underscore the critical need for domain-grounded, scientifically rigorous benchmarks to accurately characterize LLM capabilities in real-world research contexts.

Technology Category

Application Category

📝 Abstract

Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on scientific long-context understanding and reasoning.

Measuring LLMs' potential in scientific problem-solving workflows.

Assessing LLMs' performance on domain-specific, multi-step reasoning tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

CURIE benchmark for scientific LLM evaluation

Ten tasks across six scientific disciplines

Evaluates LLMs on long-context comprehension

🔎 Similar Papers

No similar papers found.