MolVision: Molecular Property Prediction with Vision Language Models

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional text-based molecular representations (e.g., SMILES, SELFIES) suffer from limited generalization in molecular property prediction. Method: This paper introduces the first vision-language model (VLM)-based multimodal molecular modeling framework, incorporating molecular structure images as the visual modality. We construct a benchmark comprising 10 diverse property prediction tasks and design a joint image-text encoding and alignment mechanism, enhanced by efficient LoRA-based fine-tuning to support zero-shot, few-shot, and full fine-tuning paradigms. Contribution/Results: Experimental results demonstrate that integrating visual information significantly improves cross-task generalization. The best performance is achieved when the image encoder is properly adapted and LoRA is employed—outperforming text-only baselines across all settings. The code and benchmark dataset are publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract
Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $href{https://molvision.github.io/MolVision/}{https://molvision.github.io/MolVision/}$.
Problem

Research questions and friction points this paper is trying to address.

Improving molecular property prediction using multimodal vision-language models
Addressing ambiguity in textual molecular representations like SMILES/SELFIES
Enhancing generalization across diverse molecular property tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for molecular prediction
Combines molecular images with textual descriptions
Employs LoRA for efficient fine-tuning
🔎 Similar Papers
No similar papers found.