phylo2vec: a library for vector-based phylogenetic tree manipulation

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Newick format incurs substantial memory overhead and inefficient operations when handling large-scale phylogenetic trees, hindering downstream analyses such as machine learning. To address this, we propose an integer-vector-based representation for binary phylogenetic trees, establishing a bijective mapping between tree topology and an (n−1)-dimensional integer vector—enabling lossless, compact encoding. Our method significantly improves storage efficiency, tree sampling speed, and pairwise tree comparison performance. The core library is implemented in Rust for memory safety and performance, with Python and R bindings enabling cross-platform, high-efficiency memory management. Experiments demonstrate substantial acceleration in large-scale genomic and linguistic evolutionary analyses; the framework has been integrated into multiple bioinformatics workflows. This work provides a scalable, foundational representation for phylogeny-driven AI modeling.

Technology Category

Application Category

📝 Abstract
Phylogenetics is a fundamental component of many analysis frameworks in biology as well as linguistics to study the evolutionary relationships of different entities. Recently, the advent of large-scale genomics and the SARS-CoV-2 pandemic has underscored the necessity for phylogenetic software to handle large datasets of genomes or phylogenetic trees. While significant efforts have focused on scaling optimisation algorithms, visualization, and lineage identification, an emerging body of research has been dedicated to efficient representations of data for genomes and phylogenetic trees such as phylo2vec. Compared to traditional tree representations such as the Newick format, which represents trees using strings of nested parentheses, modern representations of phylogenetic trees utilize integer vectors to define the tree topology traversal. This approach offers several advantages, including easier manipulability, increased memory efficiency, and applicability to downstream tasks such as machine learning. Here, we present the latest release of phylo2vec (or Phylo2Vec), a high-performance software package for encoding, manipulating, and analysing binary phylogenetic trees. At its core, the package is based on the phylo2vec representation of binary trees, which defines a bijection from any tree topology with $n$ leaves into an integer vector of size $n-1$. Compared to the traditional Newick format, phylo2vec is designed to enable fast sampling and comparison of binary trees. This release features a core implementation in Rust, providing significant performance improvements and memory efficiency, while remaining available in Python (superseding the release described in the original paper) and R via dedicated wrappers, making it accessible to a broad audience in the bioinformatics community.
Problem

Research questions and friction points this paper is trying to address.

Efficient representation of phylogenetic tree data
Handling large datasets for genomes and trees
Enabling fast tree sampling and comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses integer vectors for tree topology
Enables fast sampling and comparison
Implements Rust for performance efficiency
🔎 Similar Papers
No similar papers found.
Neil Scheidwasser
Neil Scheidwasser
University of Copenhagen
Deep learningspeech processingphylogeneticsanimal behaviorpublic health
A
Ayush Nag
eScience Institute, University of Washington, Seattle, United States
Matthew J Penn
Matthew J Penn
Data Scientist, The Football Association
FootballPhylogeneticsEpidemiology
A
Anthony MV Jakob
Independent researcher
Frederik Mølkjær Andersen
Frederik Mølkjær Andersen
PhD Fellow, University of Copenhagen
Applied probabilityMathematical ModelingPhylogeneticsInfectious Diseases
M
Mark P Khurana
Section of Health Data Science and AI, University of Copenhagen, Copenhagen, Denmark
L
Landung Setiawan
eScience Institute, University of Washington, Seattle, United States
M
Madeline Gordon
eScience Institute, University of Washington, Seattle, United States
D
David A Duchêne
Section of Health Data Science and AI, University of Copenhagen, Copenhagen, Denmark
Samir Bhatt
Samir Bhatt
Professor of Machine Learning and Public Health University of Copenhagen
Public HealthGeneticsInfectious DiseasesMachine LearningMathematical Biology