🤖 AI Summary
This study addresses nonlinear regression problems where covariates exhibit a tree-structured hierarchy, proposing the KR-TEXAS method to simultaneously achieve model interpretability and statistical efficiency by automatically selecting the optimal level of feature aggregation. KR-TEXAS introduces, for the first time, tree-guided adaptive feature aggregation into a nonparametric regression framework, employing a Nadaraya–Watson-type estimator augmented with dynamic penalty weights derived from preliminary estimates of the partial derivatives of the regression function. This approach enables joint model selection and feature aggregation. Theoretical analysis establishes model selection consistency, while simulation studies demonstrate superior performance in both prediction accuracy and variable selection. The method is successfully applied to predict short-chain fatty acid levels from microbiome data.
📝 Abstract
In regression problems where covariates are naturally organized in a hierarchical tree structure, a central challenge is to select the resolution at which covariates enter the model. Determining this level of feature aggregation is of intrinsic scientific interest and can improve statistical efficiency by inducing sparsity. While a rich literature addresses this problem in the linear setting, extending feature aggregation to the nonlinear setting remains an open challenge. In this work, we propose to simultaneously perform model selection and feature aggregation through a penalized Nadaraya-Watson-type estimator. Our proposed estimator, Kernel Regression with Tree-EXploring AggregationS (KR-TEXAS), constructs adaptive penalty weights for the features based on pilot estimators of the regression function's partial derivatives. Under mild conditions, we establish model selection consistency for a well-defined target aggregation set, and our simulations show strong performance in both model selection and prediction. Finally, we demonstrate the utility of our procedure by applying it to a microbiome data set to predict short chain fatty acids. A user-friendly implementation of our procedure is available in the R package krtexas.