Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

๐Ÿ“… 2026-01-21
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

190K/year
๐Ÿค– AI Summary
This work addresses the challenge of multimodal fusion in large language models, where geometric disparities between modalities hinder effective integration and modular architectures expend substantial computational resources on alignment rather than deep reasoning. The study formally characterizes the โ€œmodality gapโ€ problem for the first time and introduces a natively unified architecture that eliminates this gap by employing a shared cross-modal tokenizer to map all inputs directly into a common token space. This design enables zero modality gap across all hidden layers without requiring dedicated modality-specific encoders. The proposed model supports end-to-end training and demonstrates significant performance gains over conventional modular approaches on DNAโ€“text multimodal tasks, exhibiting superior capabilities in deep biological reasoning.
๐Ÿ“ Abstract
Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model's ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic $k$-mers into the language model's existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.
Problem

Research questions and friction points this paper is trying to address.

modality gap
multimodal integration
heterogeneous inputs
cross-modal reasoning
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-gap integration
unified tokenizer
native multimodal architecture
geometric modality gap
shared token space
Yanan Li
Yanan Li
Zhejiang Lab
Computer VisionZero/Few-Shot LearningLong-Tailed Learning
C
Christina Yi Jin
Research Center for Frontier Fundamental Studies, Zhejiang Lab
Yuan Jin
Yuan Jin
Apple
Quantum Cascade LasersSemiconductor PhysicsIntegrated Photonics
M
Manli Luo
Research Center for Frontier Fundamental Studies, Zhejiang Lab
T
Tie Xu
Research Center for Frontier Fundamental Studies, Zhejiang Lab
S
Shuai Jiao
Research Center for Scientific Data Hub, Zhejiang Lab
W
Wei He
Research Center for Frontier Fundamental Studies, Zhejiang Lab
Qing Zhang
Qing Zhang
Zhejiang Lab
Data ScienceInternet of ThingsSmart HomeSmart Ageing