oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work rigorously evaluates the real-world capability of large language models (LLMs) in organic reaction mechanism reasoning—specifically assessing logical coherence across multi-step pathways, validity of intermediates, and chemical consistency. To this end, we introduce oMeBench, the first large-scale, expert-annotated benchmark (>10,000 mechanism steps) with fine-grained type labels, difficulty ratings, and fully specified intermediates. We further propose oMeS, a dynamic evaluation framework integrating step-level logical validation with molecular structural similarity for interpretable, fine-grained scoring. Our analysis reveals a critical bottleneck in current LLMs’ ability to maintain multi-step logical consistency. Leveraging mechanism-aware prompting and domain-specific fine-tuning, we achieve a 50% performance gain over state-of-the-art closed-source models. Key contributions include: (1) the first high-quality, expert-curated mechanism reasoning benchmark; (2) a dual-dimensional automated evaluation paradigm combining logical and structural fidelity; and (3) a reproducible, chemistry-grounded methodology for enhancing mechanistic reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' genuine chemical reasoning in organic reaction mechanisms

Assessing multi-step reasoning consistency and chemical validity in LLMs

Developing benchmark for organic mechanism elucidation and AI reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing oMeBench for organic mechanism reasoning

Proposing oMeS framework with logic and similarity scoring

Using prompting and fine-tuning to boost performance

🔎 Similar Papers

No similar papers found.