Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately converting two-dimensional molecular structure diagrams from printed documents into SMILES strings remains a challenging task, with existing vision-language models exhibiting suboptimal performance. This work proposes MolSeek-OCR, which formulates the problem as an image-conditioned sequence generation task. Built upon the DeepSeek-OCR-2 architecture, the model employs a two-stage progressive fine-tuning strategy: first applying parameter-efficient fine-tuning via LoRA, followed by selective full-parameter fine-tuning with a hierarchical learning rate schedule. Training leverages a hybrid dataset combining synthetic and real patent images, effectively mitigating training instability and substantially enhancing generalization. The proposed approach achieves state-of-the-art performance among image-to-sequence models in exact-match accuracy for molecular structure recognition.
📝 Abstract
Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.
Problem

Research questions and friction points this paper is trying to address.

Optical Chemical Structure Recognition
Molecular Structure Recognition
SMILES Generation
Vision-Language Models
Fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive fine-tuning
LoRA
SMILES generation
molecular structure recognition
split learning rates
H
Haocheng Tang
1School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, USA; 2Computational Chemical Genomics Screening Center, University of Pittsburgh, Pittsburgh, PA, USA; 3Khoury College of Computer Science, Northeastern University, Boston, MA, USA
X
Xingyu Dang
4Department of Computer Science, Princeton University, Princeton, NJ, USA
Junmei Wang
Junmei Wang
Professor of computational chemistry/biology, School of Pharmacy, University of Pittsburgh
Computational ChemistryForce Field DevelopmentComputational BiophysicsDrug DesignPharmacometrics and Systems Pharmacolog