MolViBench: Evaluating LLMs on Molecular Vibe Coding

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing benchmarks struggle to comprehensively evaluate the integrated capabilities of large language models in molecular programming tasks, which require synergistic programming skills, chemical knowledge, and domain-specific reasoning. To address this gap, this work introduces the first evaluation benchmark specifically designed for molecular Vibe programming, comprising 358 tasks derived from real-world drug discovery workflows and spanning five cognitive levels. We propose a multi-tiered evaluation framework that combines type-aware output matching with abstract syntax tree (AST)-based semantic fallback analysis to jointly assess code executability and chemical correctness. This study establishes the first systematic taxonomy of molecular programming tasks, bridging general-purpose code generation and chemical domain expertise. Leveraging this platform, we conduct fine-grained evaluations of nine state-of-the-art code large language models across three molecular Vibe programming paradigms, offering a reliable diagnostic tool for AI-driven molecular discovery.

📝 Abstract

Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs' coding capabilities in AI-accelerated molecular discovery.

Problem

Research questions and friction points this paper is trying to address.

Molecular Vibe Coding

LLM evaluation

code generation

molecular tasks

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Molecular Vibe Coding

MolViBench

executable code generation

multi-layered evaluation

AST-based semantic analysis

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

2024-06-09arXiv.orgCitations: 1

Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity

2024-10-04arXiv.orgCitations: 0