🤖 AI Summary
This work introduces the first foundational model for interdisciplinary scientific reasoning, designed to unify natural language with multimodal scientific representations—including chemical formulas, protein sequences, and crystal structures—and support long-chain reasoning. Methodologically, the model is pretrained on a 206B-token multimodal scientific corpus, followed by cold-start-guided instruction tuning and reinforcement learning with task-specific reward functions to enable robust knowledge transfer and high-fidelity reasoning. Evaluated across 103 diverse scientific tasks, it significantly outperforms domain-specific models in text–scientific format translation, property prediction/classification, and conditional/unconditional sequence generation and design. It further demonstrates superior cross-domain generalization and output fidelity. The code, model checkpoints, and benchmark suite are fully open-sourced.
📝 Abstract
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.