PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

📅 2024-12-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing phylogenetic inference methods face high computational complexity, strong dependence on predefined evolutionary models and multiple sequence alignments (MSAs), and limited accuracy when jointly optimizing tree topology and branch lengths. This paper introduces the first language model–driven, end-to-end phylogenetic tree generation framework, formulating tree construction as a differentiable graph-structured generation task under conditional constraints—eliminating the need for explicit evolutionary models or MSAs. Key contributions include: (1) integrating a pre-trained genomic language model with variational inference; (2) designing a differentiable tree-structure scoring function to improve gradient stability during optimization; and (3) enabling joint optimization of topology and branch lengths. Evaluated on eight real-world benchmark datasets, our method significantly outperforms state-of-the-art MCMC-based and existing variational inference approaches in both accuracy and robustness. Visualization further reveals finer-grained evolutionary relationships.

Technology Category

Application Category

📝 Abstract
Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.
Problem

Research questions and friction points this paper is trying to address.

phylogenetic tree computation
computational complexity
resource consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

PhyloGen
Bioinformatics Language Model
Evolutionary Relationship Optimization
🔎 Similar Papers
2024-04-06International Conference on Learning RepresentationsCitations: 7