Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses model-free estimation of the conditional distribution (P(Y mid X = mathbf{x})) and its continuous functionals (Psi(cdot)), where the response (Y) resides in a locally compact Polish space (e.g., multivariate continuous or mixed-type distributions) under complex survey designs. We propose the first distributional random forest for complex surveys, featuring: (i) an MMD-based splitting criterion using Hájek-weighted node distributions; (ii) PSU-level honest partitioning; and (iii) design-weighted kernel mean embeddings. We establish theoretical consistency for both finite-population and superpopulation inference targets. Integrated with pseudo-population bootstrap and leveraging two-stage clustered sampling structure, our method successfully estimates tolerance regions for the conditional joint distribution of diabetes biomarkers in NHANES data, uncovering heterogeneous risk patterns across population subgroups. Simulation studies demonstrate that ignoring survey design leads to substantial increases in statistical error.

Technology Category

Application Category

📝 Abstract

We study estimation of the conditional law $P(Y|X=mathbf{x})$ and continuous functionals $Ψ(P(Y|X=mathbf{x}))$ when $Y$ takes values in a locally compact Polish space, $X in mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

Problem

Research questions and friction points this paper is trying to address.

Estimates conditional distributions under complex survey designs.

Proposes a survey-calibrated random forest with design features.

Analyzes forest estimators for survey data consistency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey-calibrated distributional random forest with pseudo-population bootstrap

PSU-level honesty and MMD split criterion for kernel embeddings

Design consistency framework for forest estimators under complex surveys

🔎 Similar Papers

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring