Pre-validation Revisited

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This paper addresses the problem that pre-validation statistics—used for inference with high-dimensional heterogeneous features (e.g., two feature sets differing markedly in dimensionality and with unknown inter-feature dependencies)—exhibit asymptotic distributions deviating from standard normality, leading to invalid hypothesis tests. We propose a novel theoretical and inferential framework that dispenses with the conventional assumption of feature independence. For the first time, we derive the analytical asymptotic distribution of pre-validation statistics under general non-independence conditions and develop a universal nonparametric bootstrap procedure for robust inference. Our approach simultaneously improves prediction accuracy, error estimation fidelity, and hypothesis testing reliability. Empirical validation on a real breast cancer cohort and simulated genome-wide association studies (GWAS) demonstrates: (i) over 40% reduction in error estimation bias; (ii) confidence interval coverage converging to nominal levels; and (iii) substantial improvement in p-value calibration.

Technology Category

Application Category

📝 Abstract

Pre-validation is a way to build prediction model with two datasets of significantly different feature dimensions. Previous work showed that the asymptotic distribution of test statistic for the pre-validated predictor deviated from a standard Normal, hence will lead to issues in hypothesis tests. In this paper, we revisited the pre-validation procedure and extended the problem formulation without any independence assumption on the two feature sets. We proposed not only an analytical distribution of the test statistics for pre-validated predictor under certain models, but also a generic bootstrap procedure to conduct inference. We showed properties and benefits of pre-validation in prediction, inference and error estimation by simulation and various applications, including analysis of a breast cancer study and a synthetic GWAS example.

Problem

Research questions and friction points this paper is trying to address.

Extends pre-validation without independence assumptions on feature sets

Provides analytical distribution and bootstrap for test statistics

Demonstrates pre-validation benefits in prediction and inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends pre-validation without independence assumptions

Proposes analytical distribution for test statistics

Introduces generic bootstrap inference procedure

🔎 Similar Papers

Distributional bias compromises leave-one-out cross-validation