UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

📅 2024-08-01
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
Existing molecular large language models (LLMs) predominantly adopt adapter-based architectures, resulting in modality asymmetry and insufficient supervision on the molecular side. Method: We propose a unified molecular–text LLM that treats molecules as a “foreign language” and introduces a novel vector quantization (VQ) + Q-Former collaborative tokenizer for molecular tokenization into discrete, learnable tokens. This enables genuine modality equivalence—shared vocabulary, causal masking, and autoregressive modeling—between molecules and text. Contribution/Results: The model employs a four-stage progressive pretraining strategy and achieves state-of-the-art performance across diverse molecular understanding and generation tasks. It supports bidirectional cross-modal generation (molecule ↔ text) and demonstrates strong generalization to multiple downstream tasks, marking the first framework to realize truly symmetric, vocabulary-shared, autoregressive multimodal modeling of molecules and natural language.

Technology Category

Application Category

📝 Abstract
The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that do not treat molecule and text modalities equally and lack a supervision signal for the molecule modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm, enabling it to interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.
Problem

Research questions and friction points this paper is trying to address.

UniMoT addresses unequal treatment of molecule and text modalities in molecular LLMs.
It introduces a tokenizer to unify molecule and text under shared representation.
The model enables molecule comprehension and generation as text tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector Quantization-driven tokenizer for molecules
Unified token representation for molecule and text
Autoregressive training for multi-modal tasks