Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing challenges in molecular structure–activity relationship (SAR) extraction from scientific literature and patents—including document format heterogeneity, weak layout understanding, and low chemical structure recognition accuracy—this paper proposes the first domain-specific collaborative framework for SAR extraction. Our method integrates optical chemical structure recognition (OCSR), document layout analysis, supervised fine-tuning of a multimodal large language model (MLLM), and a cheminformatics toolchain. We introduce DocSAR-200, the first rigorously annotated benchmark for SAR extraction. Experiments show our approach achieves 80.78% Table Recall, outperforming the end-to-end GPT-4o baseline by 51.48%. It supports efficient inference and has been deployed as a production-ready web application. Key contributions include: (1) the first SAR-oriented multimodal collaborative architecture; (2) DocSAR-200, the first high-quality, expert-annotated SAR extraction benchmark; and (3) an open-source, deployable technical solution.

Technology Category

Application Category

📝 Abstract
Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.
Problem

Research questions and friction points this paper is trying to address.

Extracting SARs from diverse scientific documents is challenging
Existing methods lack accuracy for specialized tasks like OCSR
Heterogeneous document formats hinder reliable SAR extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates domain-specific tools with fine-tuned MLLMs
Introduces DocSAR-200 benchmark for SAR extraction
Achieves high Table Recall (80.78%) outperforming GPT-4o
🔎 Similar Papers
No similar papers found.