Collapsing Categories for Regression with Mixed Predictors

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Excessive categorical predictors in regression models often cause information fragmentation, degrading estimation accuracy. To address this, we propose an adaptive category-merging method based on pairwise vector fusion LASSO, which jointly clusters categories exhibiting similar response patterns, thereby reducing model complexity. The method operates within a general loss-function framework, ensuring compatibility with diverse regression settings—including linear, logistic, and Poisson regression. We employ an inexact proximal gradient descent algorithm to guarantee computational feasibility and theoretical convergence. Extensive simulations and empirical analysis on real-world Spotify data demonstrate that our approach substantially reduces the number of categories while improving average predictive accuracy by 12.7%. Notably, this work constitutes the first systematic application of vector fusion LASSO to categorical merging, offering both theoretical rigor—via established statistical consistency properties—and practical effectiveness across heterogeneous regression tasks.

Technology Category

Application Category

📝 Abstract
Categorical predictors are omnipresent in everyday regression practice: in fact, most regression data involve some categorical predictors, and this tendency is increasing in modern applications with more complex structures and larger data sizes. However, including too many categories in a regression model would seriously hamper accuracy, as the information in the data is fragmented by the multitude of categories. In this paper, we introduce a systematic method to reduce the complexity of categorical predictors by adaptively collapsing categories in regressions, so as to enhance the performance of regression estimation. Our method is based on the {em pairwise vector fused LASSO}, which automatically fuses the categories that bear a similar regression relation with the response. We develop our method under a wide class of regression models defined by a general loss function, which includes linear models and generalized linear models as special cases. We rigorously established the category collapsing consistency of our method, developed an Inexact Proximal Gradient Descent algorithm to implement it, and proved the feasibility and convergence of our algorithm. Through simulations and an application to Spotify music data, we demonstrate that our method can effectively reduce categorical complexity while improving prediction performance, making it a powerful tool for regression with mixed predictors.
Problem

Research questions and friction points this paper is trying to address.

Reduces categorical predictor complexity through adaptive collapsing
Enhances regression accuracy by fusing similar response-related categories
Applies to general regression models with mixed predictor types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise vector fused LASSO for category fusion
Adaptive collapsing of similar response categories
General loss function framework for mixed predictors
🔎 Similar Papers
No similar papers found.
C
Chaegeun Song
Department of Statistics, The Pennsylvania State University
Z
Zhong Zheng
Department of Statistics, The Pennsylvania State University
B
Bing Li
Department of Statistics, The Pennsylvania State University
Lingzhou Xue
Lingzhou Xue
Professor of Statistics, The Pennsylvania State University
High Dimensional StatisticsStatistical LearningStatistical Network AnalysisNonconvex OptimizationData Science