Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

📅 2026-01-21

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of multimodal fusion in large language models, where geometric disparities between modalities hinder effective integration and modular architectures expend substantial computational resources on alignment rather than deep reasoning. The study formally characterizes the “modality gap” problem for the first time and introduces a natively unified architecture that eliminates this gap by employing a shared cross-modal tokenizer to map all inputs directly into a common token space. This design enables zero modality gap across all hidden layers without requiring dedicated modality-specific encoders. The proposed model supports end-to-end training and demonstrates significant performance gains over conventional modular approaches on DNA–text multimodal tasks, exhibiting superior capabilities in deep biological reasoning.

📝 Abstract

Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model's ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic $k$-mers into the language model's existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.

Problem

Research questions and friction points this paper is trying to address.

modality gap

multimodal integration

heterogeneous inputs

cross-modal reasoning

Multimodal Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-gap integration

unified tokenizer

native multimodal architecture