MolViBench: Evaluating LLMs on Molecular Vibe Coding

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
Existing benchmarks struggle to comprehensively evaluate the integrated capabilities of large language models in molecular programming tasks, which require synergistic programming skills, chemical knowledge, and domain-specific reasoning. To address this gap, this work introduces the first evaluation benchmark specifically designed for molecular Vibe programming, comprising 358 tasks derived from real-world drug discovery workflows and spanning five cognitive levels. We propose a multi-tiered evaluation framework that combines type-aware output matching with abstract syntax tree (AST)-based semantic fallback analysis to jointly assess code executability and chemical correctness. This study establishes the first systematic taxonomy of molecular programming tasks, bridging general-purpose code generation and chemical domain expertise. Leveraging this platform, we conduct fine-grained evaluations of nine state-of-the-art code large language models across three molecular Vibe programming paradigms, offering a reliable diagnostic tool for AI-driven molecular discovery.
📝 Abstract
Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs' coding capabilities in AI-accelerated molecular discovery.
Problem

Research questions and friction points this paper is trying to address.

Molecular Vibe Coding
LLM evaluation
code generation
molecular tasks
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Molecular Vibe Coding
MolViBench
executable code generation
multi-layered evaluation
AST-based semantic analysis