SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Predicting cellular responses to genetic perturbations remains a central challenge in systems biology and therapeutic discovery. While large language models (LLMs) show emerging promise for biological reasoning, their adaptation to structured perturbation experimental data is limited. To address this, we propose a supervised fine-tuning framework grounded in synthetic reasoning trajectories: state-of-the-art models generate biologically informed reasoning chains, which are rigorously quality-filtered and used to distill implicit domain knowledge—thereby substantially enhancing LLMs’ generalization capability for perturbation response modeling. Evaluated on the PerturbQA benchmark, our method achieves state-of-the-art performance using only 2% high-quality training data and attains 87% accuracy on unseen RPE1 cell types, effectively overcoming key bottlenecks in data efficiency and cross-cell-type generalization.

Technology Category

Application Category

📝 Abstract
Predicting cellular responses to genetic perturbations represents a fundamental challenge in systems biology, critical for advancing therapeutic discovery and virtual cell modeling. While large language models (LLMs) show promise for biological reasoning, their application to perturbation prediction remains underexplored due to challenges in adapting them to structured experimental data. We present SynthPert, a novel method that enhances LLM performance through supervised fine-tuning on synthetic reasoning traces generated by frontier models. Using the PerturbQA benchmark, we demonstrate that our approach not only achieves state-of-the-art performance but surpasses the capabilities of the frontier model that generated the training data. Our results reveal three key insights: (1) Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate, (2) This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells, and (3) Performance gains persist despite using only 2% of quality-filtered training data. This work shows the effectiveness of synthetic reasoning distillation for enhancing domain-specific reasoning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Predicting cellular responses to genetic perturbations in systems biology
Enhancing LLM biological reasoning with synthetic reasoning traces
Overcoming challenges in adapting LLMs to structured experimental data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic reasoning traces enhance LLM biological reasoning
Supervised fine-tuning improves cellular perturbation prediction
Method achieves generalization with minimal quality-filtered data
🔎 Similar Papers
No similar papers found.
L
Lawrence Phillips
AI and Digital Innovation, Novo Nordisk
Marc Boubnovski Martell
Marc Boubnovski Martell
AI Scientist at Novo Nordisk
LLMReinforcement LearningSynthetic DataText-to-SQL3D vision
A
Aditya Misra
AI and Digital Innovation, Novo Nordisk
J
Josefa Lia Stoisser
AI and Digital Innovation, Novo Nordisk
C
Cesar A. Prada-Medina
AI and Digital Innovation, Novo Nordisk
Rory Donovan-Maiye
Rory Donovan-Maiye
Novo Nordisk
Computational BiologyBiophysicsMachine LearningAI
Kaspar Märtens
Kaspar Märtens
Research Scientist at Novo Nordisk
Statistical Machine LearningBayesian StatisticsHealthGenomics#unitartucs