SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chemical reasoning benchmarks suffer from task oversimplification, inadequate process evaluation, and misalignment with expert-level capabilities. To address these limitations, we introduce SUPERChem, a multimodal benchmark comprising 500 expert-crafted, cross-subfield challenging problems. SUPERChem introduces Reasoning Path Fidelity (RPF), a novel scoring metric that quantifies reasoning quality by comparing model-generated solution paths against expert-annotated ground-truth traces. It employs an original content generation and iterative curation pipeline to ensure zero data contamination. Integrating both textual and visual problem formulations, SUPERChem establishes a human–machine comparative evaluation framework enabling analysis of visual modality’s impact on chemical reasoning. Human experts achieve a baseline accuracy of 40.3%, while the strongest evaluated model—GPT-5 (High)—scores only 38.5%, confirming the benchmark’s rigor and discriminative power. SUPERChem is the first benchmark to enable systematic, quantitative assessment of expert-level chemical reasoning processes.

Technology Category

Application Category

📝 Abstract
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
Problem

Research questions and friction points this paper is trying to address.

Current chemistry benchmarks lack complexity and expert-level alignment.
Existing evaluations fail to assess reasoning processes beyond final answers.
There is no reliable multimodal benchmark for chemical reasoning quality.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated multimodal benchmark for chemistry reasoning
Reasoning Path Fidelity scoring to evaluate solution quality
Iterative curation pipeline to eliminate flawed items
🔎 Similar Papers
No similar papers found.
Z
Zehua Zhao
College of Chemistry and Molecular Engineering, Peking University
Z
Zhixian Huang
College of Chemistry and Molecular Engineering, Peking University
J
Junren Li
College of Chemistry and Molecular Engineering, Peking University
Siyu Lin
Siyu Lin
Beijing Jiaotong University
Wireless communicaions
Junting Zhou
Junting Zhou
Peking University
Large Language ModelAI for ScienceBioinformatics
F
Fengqi Cao
Chemy Group
K
Kun Zhou
TAL Education Group
Rui Ge
Rui Ge
Shanghai Jiao Tong University
Tingting Long
Tingting Long
Central South University
mobile computing and edge intelligence
Y
Yuexiang Zhu
College of Chemistry and Molecular Engineering, Peking University
Y
Yan Liu
College of Chemistry and Molecular Engineering, Peking University
J
Jie Zheng
College of Chemistry and Molecular Engineering, Peking University
J
Junnian Wei
College of Chemistry and Molecular Engineering, Peking University
R
Rong Zhu
College of Chemistry and Molecular Engineering, Peking University
P
Peng Zou
College of Chemistry and Molecular Engineering, Peking University
Wenyu Li
Wenyu Li
Harbin Institute of Technology
Computer Vision
Z
Zekai Cheng
College of Chemistry and Molecular Engineering, Peking University
Tian Ding
Tian Ding
Shenzhen Research Institute of Big Data
Yaxuan Wang
Yaxuan Wang
PhD Student of Computer Science, University of California, Santa Curz
machine learning
Y
Yizhao Yan
College of Chemistry and Molecular Engineering, Peking University
T
Tingru Wei
College of Chemistry and Molecular Engineering, Peking University
H
Haowei Ming
College of Chemistry and Molecular Engineering, Peking University
W
Weijie Mao
College of Chemistry and Molecular Engineering, Peking University
C
Chen Sun
College of Chemistry and Molecular Engineering, Peking University
Y
Yiming Liu
College of Chemistry and Molecular Engineering, Peking University