Tree-aggregated regression for compositional data with measurement errors

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge that high-dimensional compositional covariates, when aggregated according to a tree structure, become highly susceptible to measurement error, leading to increased estimation bias, instability, and failure in variable selection. To tackle this issue, the authors propose TARCO, a novel method that explicitly models the interaction between tree-based aggregation and measurement error. TARCO integrates bias-corrected estimation, tree-aware semidefinite covariance stabilization, and sparse regularization to effectively mitigate hierarchical contamination. Theoretically, the method establishes finite-sample error bounds and sign consistency guarantees that account for tree heterogeneity. Algorithmically, it combines convex optimization, cross-validated hyperparameter tuning, and consistent covariance estimation. Empirical results on both simulated and microbiome datasets demonstrate that TARCO substantially improves estimation accuracy, enhances variable selection recovery, and increases interpretability at aggregated taxonomic levels.

📝 Abstract

High-dimensional compositional covariates, often derived from count data, are subject to measurement error and are frequently analyzed after aggregation along a prespecified tree to improve interpretability in applications such as microbiome studies. Existing approaches typically handle either tree-guided compositional regression or errors-in-variables correction, but they do not account for the hierarchical contamination induced by their interaction. We show that tree aggregation turns leaf-level measurement error into level-dependent, correlated contamination across aggregated nodes, which inflates bias, weakens concentration rates for corrected estimating quantities, and leads to unstable variable selection for naive approaches. We propose Tree-Aggregated Regression with Correction for Observation Error (TARCO), which integrates bias-corrected estimating quantities with a tree-aware positive semidefinite stabilization and sparse regularization, with tuning selected by cross-validation based on the corrected objective. The resulting convex program can be solved with scalable algorithms. We establish finite-sample bounds for prediction and estimation errors and prove sign consistency under conditions that explicitly reflect tree heterogeneity. The guarantees persist when the measurement-error covariance is replaced by a consistent estimator. Simulations across multiple tree depths and a microbiome application demonstrate improved estimation accuracy, support recovery, and aggregation-level interpretability compared with methods that ignore the interaction between tree aggregation and measurement error.

Problem

Research questions and friction points this paper is trying to address.

compositional data

measurement error

tree aggregation

hierarchical contamination

errors-in-variables

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional data

measurement error

tree aggregation