Optimal Phylogenetic Reconstruction from Sampled Quartets

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the problem of efficiently reconstructing an unknown phylogenetic tree and predicting unseen quartets from only Θ(n) randomly labeled noisy quartet samples. The authors propose an algorithm based on quartet embedding and detection, integrating semidefinite programming with PAC learning theory to achieve (1−ε)-approximate recovery of the true tree’s quartet distances at a sample complexity near the information-theoretic lower bound. Their main contributions include the first method to attain near-optimal tree reconstruction performance with only Θ(n) samples, along with establishing a tight bound of Θ(n) on the Natarajan dimension of the class of phylogenetic trees—significantly improving upon existing approaches that rely on dense or substantially larger sample sets.

Technology Category

Application Category

📝 Abstract

Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called \textit{quartets}, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing $m$ noisy quartets, labeled by an unknown ground-truth tree $T$ on $n$ taxa, we want to output a tree $\widehat T$ that is \textit{close} to $T$ in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the $\mathsf{NP}$-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave $(1-\eps)$-approximation schemes (PTAS) for dense instances with $m=Θ(n^4)$ quartets, or for $m=Θ(n^2\log n)$ quartets \textit{randomly} sampled from $T$. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree $T$ from a random sample of $m=Θ(n)$ quartets, corrupted under the Random Classification Noise (RCN) model. This matches the $Ω(n)$ lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a $(1-\eps)$-approximation, but most importantly \textit{recovers} a tree close to $T$ in quartet distance; second, we show a new $Θ(n)$ bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new \textit{Quartet-based Embedding and Detection} procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth $T$ via semidefinite programming.

Problem

Research questions and friction points this paper is trying to address.

Phylogenetic Reconstruction

Quartet Reconstruction

Sample Complexity

Random Classification Noise

Systematics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quartet Reconstruction

Random Classification Noise

Natarajan Dimension