HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scientific AI benchmarks are highly fragmented, focusing on narrow-domain tasks and failing to capture the hierarchical, interdisciplinary nature of authentic scientific inquiry. Method: We introduce HiSciBench—the first five-level, end-to-end scientific benchmark covering scientific literacy, literature analysis, question answering, review generation, and scientific discovery—spanning mathematics, physics, chemistry, biology, geoscience, and astronomy, and supporting multimodal (text, formulae, figures) and multilingual inputs. Its dependency-aware hierarchical evaluation framework systematically models capability evolution and inter-stage coupling across scientific reasoning phases. Contribution/Results: Built via multimodal data construction, cross-disciplinary knowledge alignment, and structured human annotation, HiSciBench reveals substantial capability decay (69% → 25%) in state-of-the-art models (e.g., GPT-5, DeepSeek-R1), establishing a reproducible, diagnosable, quantitative metric for scientific intelligence.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce extbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: extit{Scientific Literacy} (L1), extit{Literature Parsing} (L2), extit{Literature-based Question Answering} (L3), extit{Literature Review Generation} (L4), and extit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks are fragmented and fail to reflect the hierarchical nature of scientific inquiry.
Most benchmarks focus on narrow tasks and lack multi-disciplinary coverage.
Current evaluations do not assess models across the complete scientific workflow from reading to discovery.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark covering five scientific workflow levels
Multidisciplinary dataset with multimodal and cross-lingual support
Integrated dependency-aware framework for capability diagnosis
🔎 Similar Papers
No similar papers found.
Y
Yaping Zhang
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
Q
Qixuan Zhang
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
X
Xingquan Zhang
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
Z
Zhiyuan Chen
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
Wenwen Zhuang
Wenwen Zhuang
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingArtificial IntelligenceDeep Learning
Y
Yupu Liang
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
Lu Xiang
Lu Xiang
Institute of Automation, Chinese Academy of Sciences
Dialogue SystemsNLP
Y
Yang Zhao
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing
Y
Yu Zhou
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences
C
Chengqing Zong
Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences