Structure Learning for Directed Trees with Compositional Nodes

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of modeling conditional dependencies among multivariate compositional data under probability simplex constraints, which existing graphical models struggle to handle. The authors propose the first directed tree learning framework tailored for compositional variables. Their approach employs Kullback–Leibler divergence as a scoring function and models child compositions as mixtures of parent-driven and baseline components via column-stochastic transition matrices. This formulation preserves geometric consistency while ensuring edge identifiability through non-degeneracy conditions. Theoretical analysis establishes finite-sample consistency of the method, and it naturally accommodates zero-inflated data. Experiments on both synthetic and real-world microbiome and single-cell datasets successfully recover interpretable directed structures that align with established biological mechanisms.

📝 Abstract

Compositional data, which are vectors of proportions constrained to the probability simplex, arise frequently in modern scientific applications, including microbiome relative abundances across body sites and cell-type mixture weights derived from single-cell genomics. While regression methods for compositional data are well developed, no existing graphical model framework addresses the problem of learning conditional dependence structures among multiple compositional vectors. This paper introduces a novel framework for directed tree structure learning over compositional nodes. We employ the Kullback-Leibler divergence as the scoring function and model the conditional expectation of each child composition as a mixture of a baseline composition and a parent-driven component parameterized by a column-stochastic transition matrix. This formulation respects the simplex geometry, handles zero-inflated compositions gracefully, and, combined with a non-degeneracy condition on the transition matrix, ensures identifiability of edge directions from observational data. We prove consistency of structure recovery and derive finite-sample guarantees that characterize the required sample size in terms of the signal gap, node dimension, and penalty level. The efficacy of our approach is demonstrated through simulations and applications to multi-site microbiome data and single-cell data, yielding interpretable directed structures that align with known biological mechanisms.

Problem

Research questions and friction points this paper is trying to address.

compositional data

directed tree

structure learning

conditional dependence

graphical model

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional data

directed tree

structure learning