Genomic Next-Token Predictors are In-Context Learners

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear whether in-context learning (ICL) is a modality-agnostic emergent capability of large-scale sequence modeling, or instead a phenomenon uniquely tied to human language. Method: Leveraging Evo2—a foundation model pretrained on genomic sequences—we design controlled cross-modal experiments by formulating language-like symbolic reasoning tasks over DNA sequences, without any task-specific fine-tuning or prompt engineering. Contribution/Results: We provide the first empirical evidence that ICL emerges spontaneously in genomic foundation models solely through next-nucleotide prediction pretraining. Crucially, ICL performance improves logarithmically linearly with the number of in-context examples—mirroring the scaling behavior observed in large language models. This demonstrates that ICL is a modality-independent emergent property of symbolic sequence modeling, rather than a language-specific artifact. Our findings broaden the understanding of universal intelligence mechanisms in foundation models and establish a novel zero-shot reasoning paradigm for non-linguistic sequential data.

Technology Category

Application Category

📝 Abstract
In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.
Problem

Research questions and friction points this paper is trying to address.

Investigating whether in-context learning emerges in genomic sequence models
Developing comparative framework to test symbolic reasoning across domains
Establishing evidence for modality-agnostic emergence of meta-learning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Genomic model trained on next-nucleotide prediction
Symbolic reasoning tasks in genomic and linguistic forms
Log-linear gains in pattern induction with demonstrations
🔎 Similar Papers
No similar papers found.