🤖 AI Summary
To address privacy leakage risks of raw chemical reaction data in centralized retrosynthetic prediction, this work proposes a privacy-preserving distributed retrosynthesis learning framework. The framework adopts a federated learning paradigm wherein raw reaction data remain locally stored and are never shared across clients. It introduces a novel chemistry-driven model aggregation mechanism: client model contributions are quantified adaptively via interpretable reactant property metrics, enabling weighted aggregation grounded in chemical relevance. Furthermore, the framework integrates domain-informed chemical knowledge embedding with parameter-level differential privacy during local training. Evaluated on the USPTO-50K benchmark, our method achieves approximately 20% higher top-1 route prediction accuracy than standard FedAvg, demonstrating a substantial improvement in both predictive performance and privacy preservation.
📝 Abstract
Chemical reaction data is a pivotal asset, driving advances in competitive fields such as pharmaceuticals, materials science, and industrial chemistry. Its proprietary nature renders it sensitive, as it often includes confidential insights and competitive advantages organizations strive to protect. However, in contrast to this need for confidentiality, the current standard training paradigm for machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries and frequent data transmission between entities, potentially exposing proprietary information to unauthorized access or interception during storage and transfer. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models. CKIF enables distributed training across multiple chemical organizations without compromising the confidentiality of proprietary reaction data. Instead of gathering raw reaction data, CKIF learns retrosynthesis models through iterative, chemical knowledge-informed aggregation of model parameters. In particular, the chemical properties of predicted reactants are leveraged to quantitatively assess the observable behaviors of individual models, which in turn determines the adaptive weights used for model aggregation. On a variety of reaction datasets, CKIF outperforms several strong baselines by a clear margin (e.g., ~20% performance improvement over FedAvg on USPTO-50K), showing its feasibility and superiority to stimulate further research on privacy-preserving retrosynthesis.