SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing chemical reasoning benchmarks suffer from task oversimplification, inadequate process evaluation, and misalignment with expert-level capabilities. To address these limitations, we introduce SUPERChem, a multimodal benchmark comprising 500 expert-crafted, cross-subfield challenging problems. SUPERChem introduces Reasoning Path Fidelity (RPF), a novel scoring metric that quantifies reasoning quality by comparing model-generated solution paths against expert-annotated ground-truth traces. It employs an original content generation and iterative curation pipeline to ensure zero data contamination. Integrating both textual and visual problem formulations, SUPERChem establishes a human–machine comparative evaluation framework enabling analysis of visual modality’s impact on chemical reasoning. Human experts achieve a baseline accuracy of 40.3%, while the strongest evaluated model—GPT-5 (High)—scores only 38.5%, confirming the benchmark’s rigor and discriminative power. SUPERChem is the first benchmark to enable systematic, quantitative assessment of expert-level chemical reasoning processes.

Technology Category

Application Category

📝 Abstract

Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.

Problem

Research questions and friction points this paper is trying to address.

Current chemistry benchmarks lack complexity and expert-level alignment.

Existing evaluations fail to assess reasoning processes beyond final answers.

There is no reliable multimodal benchmark for chemical reasoning quality.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated multimodal benchmark for chemistry reasoning

Reasoning Path Fidelity scoring to evaluate solution quality

Iterative curation pipeline to eliminate flawed items

🔎 Similar Papers

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area