ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the dual bottlenecks limiting large language models (LLMs) in understanding chemical reaction diagrams: generic visual encoders struggle to parse molecular topologies, and linear representations like SMILES fail to effectively activate chemical knowledge within LLMs. To overcome these limitations, the authors propose ChemVA, a novel framework that introduces a two-stage strategy of visual anchoring and semantic alignment. By detecting functional groups at mixed granularities, ChemVA aligns visual features with chemical entity names, thereby activating domain-specific knowledge embedded in LLMs. The study also introduces OCRD-Bench, a new benchmark rich in visual–semantic context, on which ChemVA achieves 92.0% structural recognition accuracy. Evaluated across nine open-source LLMs, the approach yields an average improvement of approximately 20 percentage points, attaining chemical reasoning performance comparable to state-of-the-art closed-source systems.

📝 Abstract

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Chemical Reaction Diagrams

Large Language Models

Visual Deficit

Semantic Disconnect

Molecular Graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chemical Visual Activation

Visual Anchor

Semantic Alignment