🤖 AI Summary
This work rigorously evaluates the real-world capability of large language models (LLMs) in organic reaction mechanism reasoning—specifically assessing logical coherence across multi-step pathways, validity of intermediates, and chemical consistency. To this end, we introduce oMeBench, the first large-scale, expert-annotated benchmark (>10,000 mechanism steps) with fine-grained type labels, difficulty ratings, and fully specified intermediates. We further propose oMeS, a dynamic evaluation framework integrating step-level logical validation with molecular structural similarity for interpretable, fine-grained scoring. Our analysis reveals a critical bottleneck in current LLMs’ ability to maintain multi-step logical consistency. Leveraging mechanism-aware prompting and domain-specific fine-tuning, we achieve a 50% performance gain over state-of-the-art closed-source models. Key contributions include: (1) the first high-quality, expert-curated mechanism reasoning benchmark; (2) a dual-dimensional automated evaluation paradigm combining logical and structural fidelity; and (3) a reproducible, chemistry-grounded methodology for enhancing mechanistic reasoning capabilities.
📝 Abstract
Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.