Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Protein and RNA language models suffer from limited expressivity in discrete token spaces, hindering chain-of-thought (CoT) reasoning. Method: We propose *reflective pretraining*, the first framework to enable CoT in biomolecular sequence modeling: by theoretically characterizing the expressivity of biological languages, we design an enhanced token space that allows models to generate auxiliary “thinking tokens” for token-level self-correction—bypassing reliance on natural-language intermediate steps. Contribution/Results: This paradigm establishes an intrinsic, sequence-native self-debugging mechanism for non-textual sequences. Experiments demonstrate substantial improvements in downstream tasks—including protein structure prediction and functional annotation—and empirically validate token-level reasoning and correction capabilities across multiple benchmarks, outperforming standard pretraining with statistically significant gains.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
Problem

Research questions and friction points this paper is trying to address.

Biological sequence models lack expressiveness for Chain-of-Thought reasoning
Protein language token spaces limit intermediate reasoning and self-correction
Standard pretraining cannot generate auxiliary thinking tokens for complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reflection pretraining enables token-level self-correction
Augmented token set enhances biological language expressiveness
Generates auxiliary thinking tokens for intermediate reasoning
🔎 Similar Papers
No similar papers found.
X
Xiang Zhang
Fudan University
Jiaqi Wei
Jiaqi Wei
PhD student, Zhejiang University
NLPLLMAI for Science
Y
Yuejin Yang
Shanghai Artificial Intelligence Laboratory
Z
Zijie Qiu
Fudan University
Y
Yuhan Chen
Shanghai Artificial Intelligence Laboratory
Z
Zhiqiang Gao
Shanghai Artificial Intelligence Laboratory
Muhammad Abdul-Mageed
Muhammad Abdul-Mageed
The University of British Columbia
Natural Language ProcessingDeep Learning
L
Laks V. S. Lakshmanan
University of British Columbia
W
Wanli Ouyang
The Chinese University of Hong Kong
Chenyu You
Chenyu You
Assistant Professor, Stony Brook University
Machine LearningAI for HealthComputer VisionMedical Image AnalysisMultimedia
S
Siqi Sun
Fudan University