SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the first foundational model for interdisciplinary scientific reasoning, designed to unify natural language with multimodal scientific representations—including chemical formulas, protein sequences, and crystal structures—and support long-chain reasoning. Methodologically, the model is pretrained on a 206B-token multimodal scientific corpus, followed by cold-start-guided instruction tuning and reinforcement learning with task-specific reward functions to enable robust knowledge transfer and high-fidelity reasoning. Evaluated across 103 diverse scientific tasks, it significantly outperforms domain-specific models in text–scientific format translation, property prediction/classification, and conditional/unconditional sequence generation and design. It further demonstrates superior cross-domain generalization and output fidelity. The code, model checkpoints, and benchmark suite are fully open-sourced.

Technology Category

Application Category

📝 Abstract
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
Problem

Research questions and friction points this paper is trying to address.

Aligning natural language with heterogeneous scientific representations across disciplines
Developing a foundation model for scientific reasoning covering 103 diverse tasks
Enhancing cross-domain generalization and fidelity in scientific workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained on 206B tokens of scientific text and sequences
Aligned via SFT, bootstrapping, and reward-shaped reinforcement learning
Supports translation, extraction, prediction across 103 scientific tasks
🔎 Similar Papers
No similar papers found.
Y
Yizhou Wang
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
C
Chen Tang
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
Han Deng
Han Deng
Houston Methodist Research Institute
Machine Learning
J
Jiabei Xiao
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
J
Jiaqi Liu
Shanghai Artificial Intelligence Laboratory
Jianyu Wu
Jianyu Wu
School of Computer Science, Peking University
Open Source SoftwareSoftware EngineeringMining Software Repositories
J
Jun Yao
Shanghai Artificial Intelligence Laboratory, University of Science and Technology of China
P
Pengze Li
Shanghai Artificial Intelligence Laboratory, Fudan University
Encheng Su
Encheng Su
Technical University of Munich
medical imagellmdeep learning
Lintao Wang
Lintao Wang
The University of Sydney
character animationhuman motion understanding and generationlarge language modelai4science
G
Guohang Zhuang
Shanghai Artificial Intelligence Laboratory
Yuchen Ren
Yuchen Ren
Renmin University of China
B
Ben Fei
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
M
Ming Hu
Shanghai Artificial Intelligence Laboratory
X
Xin Chen
Shanghai Artificial Intelligence Laboratory
Dongzhan Zhou
Dongzhan Zhou
Researcher at Shanghai AI Lab
AI4Sciencecomputer visiondeep learning
Junjun He
Junjun He
Shanghai Jiao Tong University
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning
Zhenfei Yin
Zhenfei Yin
University of Oxford
Deep LearningMultimodalAI AgentRobotics
J
Jiamin Wu
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
Qihao Zheng
Qihao Zheng
Shanghai AI Lab
NeuroscienceNeuroAIAI4NeuroAI4Science
Y
Yuhao Zhou
Shanghai Artificial Intelligence Laboratory
H
Huihui Xu
Shanghai Artificial Intelligence Laboratory
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
Y
Yan Lu
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong