🤖 AI Summary
In AI-driven drug discovery, molecular generation and editing face two key challenges: (1) complex structure–property relationship modeling and (2) sparse, incomplete multi-attribute annotations. To address these, we propose a data-efficient hierarchical alignment framework that jointly aligns structural representations with property labels across atomic, substructural, and molecular granularities. We introduce scaffold-based clustering coupled with an auxiliary variational autoencoder to identify representative and challenging samples. Furthermore, we design an attribute-correlation-aware masking mechanism and diversified perturbation strategies to strengthen cross-modal alignment between SMILES strings and multi-attribute labels. Our method significantly reduces reliance on large-scale annotated datasets and enables high-quality, multi-attribute-constrained molecular generation and controllable editing under few-shot settings. Extensive evaluation on two real-world drug discovery tasks demonstrates its effectiveness and practical utility.
📝 Abstract
Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow coverage and incomplete annotations of molecular properties weaken the effectiveness of property-based models. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure-property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG.