🤖 AI Summary
To address the insufficient instruction-following capability of large language models (LLMs) amid the explosive growth of scientific literature, this paper introduces SciRIFF—the first multi-disciplinary, research-task-oriented instruction-tuning dataset—comprising 137K samples across 54 tasks. It targets five core scientific competencies: information extraction, summarization, question answering, claim verification, and classification, while supporting long-context processing and structured output generation. We propose a hybrid domain-specific data distillation method coupled with a specialist-general collaborative fine-tuning strategy, implemented atop the SciTulu architecture. This yields substantial performance gains on scientific tasks (+28.1% for the 7B model, +6.5% for the 70B model), while preserving over 98% of baseline general-purpose instruction-following capability. All data, models, and code are fully open-sourced, establishing the first research-oriented instruction-following benchmark.
📝 Abstract
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed task specifications, and complex structured outputs. While instruction-following resources are available in specific domains such as clinical medicine and chemistry, SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields. To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model -- called SciTulu -- improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research.