π€ AI Summary
This work addresses the challenge that current large language models (LLMs) struggle to effectively comprehend molecular graph structures, and existing graphβLLM alignment approaches rely on static tokens, neglect stereochemistry and substructural context, and require costly LLM fine-tuning. To overcome these limitations, the authors propose EDT-Former, an entropy-guided dynamic token Transformer that generates informative molecular fragment-based tokens on-the-fly, enabling efficient alignment between a graph encoder and a frozen LLM backbone. EDT-Former introduces, for the first time, an entropy-guided mechanism that jointly captures both local and global structural characteristics of molecules, significantly enhancing the efficiency and generalization of multimodal molecular understanding. The method achieves state-of-the-art performance across multiple benchmarks, including MoleculeQA, Mol-Instructions, TDC, and MoleculeNet.
π Abstract
Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding