🤖 AI Summary
Multimodal molecular modeling faces two key challenges: unreliable 3D conformations and modality collapse, undermining model robustness and generalization. To address these, we propose MuMo—a novel framework featuring a structured fusion pipeline and a progressive cross-modal injection mechanism. MuMo preserves the independence of 2D topological and 3D geometric modalities while enabling their efficient synergy. It employs a state-space model backbone to establish a unified 2D–3D joint prior and adopts an asymmetric fusion strategy to dynamically inject 3D geometric information into the sequence stream. Evaluated on 29 molecular property prediction benchmarks, MuMo achieves an average 2.7% improvement over SOTA baselines, ranking first on 22 tasks. Notably, it delivers a 27% performance gain on the noise-sensitive LD50 task, demonstrating superior robustness to 3D conformational perturbations and validating the efficacy of its multimodal fusion design.
📝 Abstract
Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.