Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Sci-LLMs face tokenization challenges when processing raw biomolecular sequences—treating them as linguistic text risks losing functional motifs, while treating them as a standalone modality introduces cross-modal alignment difficulties. Method: We propose a paradigm shift: replace raw sequence inputs with high-level, structured biological contexts (e.g., secondary structures, evolutionary profiles, functional domains) generated by domain-specific bioinformatics tools, thereby redefining Sci-LLMs as knowledge reasoning engines rather than sequence decoders. Contribution/Results: Extensive experiments across diverse biological reasoning tasks demonstrate that context-only inputs consistently outperform sequence-only or hybrid inputs; incorporating raw sequences degrades performance, confirming their role as noise under current architectures. This work is the first to empirically establish and validate that “structured-context-driven reasoning” better aligns with the intrinsic capabilities of Sci-LLMs, offering a novel biologically grounded paradigm for scientific large language models.

Technology Category

Application Category

📝 Abstract

Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at github.com/opendatalab-raise-dev/CoKE.

Problem

Research questions and friction points this paper is trying to address.

Addressing tokenization challenges in biomolecular sequence processing

Enhancing biological reasoning through structured context over raw sequences

Repositioning Sci-LLMs as reasoning engines rather than sequence decoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using high-level structured context from bioinformatics tools

Bypassing direct interpretation of low-level sequence data

Reframing Sci-LLMs as reasoning engines over expert knowledge

🔎 Similar Papers

Large Language Models are In-Context Molecule Learners