🤖 AI Summary
Existing molecular representation methods struggle to accurately capture higher-order chemical environments—such as stereochemistry and conjugation effects—and often fail to delineate ambiguous substructure boundaries, thereby limiting the modeling of drug–target binding interactions. To address these challenges, this work proposes OverlapBPE, a novel tokenization strategy that enables data-driven molecular subword segmentation with overlapping fragments, and h-MINT, a hierarchical neural architecture that jointly models chemical semantics through dual-granularity atom–fragment interactions at both local and global levels. The proposed approach achieves a 2–4% improvement in binding affinity prediction correlation on PDBBind and LBA benchmarks, enhances key virtual screening metrics by 1–3% on DUD-E and LIT-PCBA, and demonstrates state-of-the-art overall performance in high-throughput screening on PubChem.
📝 Abstract
Accurate molecular representations are critical for drug discovery, and a central challenge lies in capturing the chemical environment of molecular fragments, as key interactions, such as H-bond and π stacking, occur only under specific local conditions. Most existing approaches represent molecules as atom-level graphs; however, atom-level representations can hardly express higher-order chemical context (e.g., stereochemistry, lone pairs, conjugation). Fragment-based methods (e.g., principal subgraph, predefined functional groups) fail to preserve essential information such as chirality, aromaticity, and ionic states. This work addresses these limitations from two aspects. (i) OverlapBPE tokenization. We propose a novel data-driven molecule tokenization method. Unlike existing approaches, our method allows overlapping fragments, reflecting the inherently fuzzy boundaries of small-molecule substructures and, together with enriched chemical information at the token level, thereby preserving a more complete chemical context. (ii) h-MINT model. OverlapBPE induces many-to-many atom-fragment mappings, which necessitate a new hierarchical architecture. We therefore develop a hierarchical molecular interaction network capable of jointly modeling interactions at both atom and fragment levels. By supporting fragment overlaps, the model naturally accommodates the many-to-many atom-fragment mappings introduced by the OverlapBPE scheme. Extensive evaluation against state-of-the-art methods shows our method improves binding affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA, enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and achieves the best overall HTS performance on PubChem assays. Further analysis demonstrates that our method effectively captures interactive information while maintaining good generalization.