Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose large language models (LLMs) suffer from hallucination and lack the precision required for mechanistic molecular inference in biomedicine. Method: We propose a compact, domain-specialized language model (110M-parameter BERT architecture) tailored for pathway modeling, integrated with an information-entropy-driven active learning strategy to enable incremental, high-accuracy prediction of molecular regulatory relationships. Focusing on tuberculosis (TB) persistence and transmission pathways, the model is trained on only 520 manually curated regulatory relations (<25% of the full dataset), iteratively refined by selecting high-confidence erroneous samples. Results: The model achieves >80% accuracy in predicting TB-related molecular interactions, demonstrating that the paradigm of compact models + task-specific design + active learning is both effective and feasible for filling critical knowledge gaps in intracellular signaling pathways.

Technology Category

Application Category

📝 Abstract
The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called "hallucinations", which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today's standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing biological pathways using small language models
Reducing AI hallucinations in biomedical research applications
Predicting molecular interactions with high accuracy via selective learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses small domain-specific language models
Applies selective incremental learning technique
Leverages entropy for iterative example selection
🔎 Similar Papers
No similar papers found.
P
Pranta Saha
Vaccine and Infectious Disease Organization, University of Saskatchewan
N
Neeraj Dhar
Vaccine and Infectious Disease Organization, University of Saskatchewan
J
Joyce Reimer
Vaccine and Infectious Disease Organization, University of Saskatchewan
J
Jeffrey Chen
Vaccine and Infectious Disease Organization, University of Saskatchewan
B
Brook Byrns
Advanced Research Computing, University of Saskatchewan
Steven Rayan
Steven Rayan
Centre for Quantum Topology and Its Applications (quanTA), University of Saskatchewan
Algebraic GeometryRepresentation TheoryMathematical PhysicsQuantum MatterQuantum Computing
C
Connor Burbridge
Advanced Research Computing, University of Saskatchewan
G
Gordon Broderick
Vaccine and Infectious Disease Organization, University of Saskatchewan