🤖 AI Summary
This paper investigates model selection consistency of best subset selection (BSS) in high-dimensional sparse linear regression. To address the insufficiency of existing theory in characterizing BSS performance, we formally introduce two novel complexity measures—“residual signal complexity” and “pseudo-projection operator complexity”—and establish that their joint behavior with the identifiability boundary constitutes a necessary and sufficient marginal condition for BSS consistency. Crucially, this condition depends solely on these three quantities, bypassing conventional reliance on signal-to-noise ratio or stringent design matrix assumptions (e.g., restricted eigenvalue conditions), and partially extends to generalized linear models. Our analysis integrates high-dimensional statistical inference, projection operator theory, and model selection frameworks to deliver the first exact characterization of BSS consistency. This result substantially broadens the theoretical applicability beyond convex alternatives such as LASSO and provides deeper insight into the intrinsic advantages of BSS.
📝 Abstract
We consider the problem of best subset selection (BSS) under high-dimensional sparse linear regression model. Recently, Guo et al. (2020) showed that the model selection performance of BSS depends on a certain identifiability margin, a measure that captures the model discriminative power of BSS under a general correlation structure that is robust to the design dependence, unlike its computational surrogates such as LASSO, SCAD, MCP, etc. Expanding on this, we further broaden the theoretical understanding of best subset selection in this paper and show that the complexities of the residualized signals, the portion of the signals orthogonal to the true active features, and spurious projections, describing the projection operators associated with the irrelevant features, also play fundamental roles in characterizing the margin condition for model consistency of BSS. In particular, we establish both necessary and sufficient margin conditions depending only on the identifiability margin and the two complexity measures. We also partially extend our sufficiency result to the case of high-dimensional sparse generalized linear models (GLMs).