Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sci-LLMs face tokenization challenges when processing raw biomolecular sequences—treating them as linguistic text risks losing functional motifs, while treating them as a standalone modality introduces cross-modal alignment difficulties. Method: We propose a paradigm shift: replace raw sequence inputs with high-level, structured biological contexts (e.g., secondary structures, evolutionary profiles, functional domains) generated by domain-specific bioinformatics tools, thereby redefining Sci-LLMs as knowledge reasoning engines rather than sequence decoders. Contribution/Results: Extensive experiments across diverse biological reasoning tasks demonstrate that context-only inputs consistently outperform sequence-only or hybrid inputs; incorporating raw sequences degrades performance, confirming their role as noise under current architectures. This work is the first to empirically establish and validate that “structured-context-driven reasoning” better aligns with the intrinsic capabilities of Sci-LLMs, offering a novel biologically grounded paradigm for scientific large language models.

Technology Category

Application Category

📝 Abstract
Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at github.com/opendatalab-raise-dev/CoKE.
Problem

Research questions and friction points this paper is trying to address.

Addressing tokenization challenges in biomolecular sequence processing
Enhancing biological reasoning through structured context over raw sequences
Repositioning Sci-LLMs as reasoning engines rather than sequence decoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using high-level structured context from bioinformatics tools
Bypassing direct interpretation of low-level sequence data
Reframing Sci-LLMs as reasoning engines over expert knowledge
🔎 Similar Papers
No similar papers found.
K
Kai Zhuang
Shanghai Artificial Intelligence Laboratory
J
Jiawei Zhang
Westlake University
Y
Yumou Liu
Shanghai Jiaotong University
Hanqun Cao
Hanqun Cao
The Chinese University of Hong Kong
Generative ModelingAI4Science
C
Chunbin Gu
The Chinese University of Hong Kong
Mengdi Liu
Mengdi Liu
Institute of Computing Technology, Chinese Academy of Sciences
Diffusion modelsAI4Science
Z
Zhangyang Gao
Shanghai Artificial Intelligence Laboratory
Zitong Jerry Wang
Zitong Jerry Wang
California Institute of Technology
Computational biology
Xuanhe Zhou
Xuanhe Zhou
Assistant Professor, Shanghai Jiao Tong University
Data ManagementArtificial Intelligence
P
Pheng-Ann Heng
The Chinese University of Hong Kong
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
C
Cheng Tan
Shanghai Artificial Intelligence Laboratory