ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal molecular representation learning methods suffer from two key limitations: (1) cross-modal interaction occurs only at the final encoder layer, neglecting hierarchical semantic dependencies; and (2) the absence of a unified prototype space leads to unstable modality alignment. To address these, we propose a hierarchical bidirectional cross-modal attention mechanism that enables fine-grained semantic alignment across multiple layers of a dual-branch encoder—comprising a graph neural network and a Transformer. Additionally, we introduce a learnable shared prototype space to explicitly guide modality-consistent representation learning. This prototype-guided hierarchical fusion strategy significantly enhances both model discriminability and interpretability. Extensive experiments on multiple molecular property prediction benchmarks—including toxicity, bioactivity, and physicochemical properties—demonstrate consistent superiority over state-of-the-art methods, validating the effectiveness and generalizability of our approach.

Technology Category

Application Category

📝 Abstract
Multimodal molecular representation learning, which jointly models molecular graphs and their textual descriptions, enhances predictive accuracy and interpretability by enabling more robust and reliable predictions of drug toxicity, bioactivity, and physicochemical properties through the integration of structural and semantic information. However, existing multimodal methods suffer from two key limitations: (1) they typically perform cross-modal interaction only at the final encoder layer, thus overlooking hierarchical semantic dependencies; (2) they lack a unified prototype space for robust alignment between modalities. To address these limitations, we propose ProtoMol, a prototype-guided multimodal framework that enables fine-grained integration and consistent semantic alignment between molecular graphs and textual descriptions. ProtoMol incorporates dual-branch hierarchical encoders, utilizing Graph Neural Networks to process structured molecular graphs and Transformers to encode unstructured texts, resulting in comprehensive layer-wise representations. Then, ProtoMol introduces a layer-wise bidirectional cross-modal attention mechanism that progressively aligns semantic features across layers. Furthermore, a shared prototype space with learnable, class-specific anchors is constructed to guide both modalities toward coherent and discriminative representations. Extensive experiments on multiple benchmark datasets demonstrate that ProtoMol consistently outperforms state-of-the-art baselines across a variety of molecular property prediction tasks.
Problem

Research questions and friction points this paper is trying to address.

Hierarchical semantic dependencies are overlooked in molecular multimodal learning
Lack unified prototype space for robust cross-modal alignment
Need fine-grained integration between molecular graphs and textual descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical encoders process graphs and texts
Layer-wise cross-modal attention aligns features
Shared prototype space guides coherent representations
🔎 Similar Papers
No similar papers found.