Modeling with Categorical Features via Exact Fusion and Sparsity Regularisation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of parameter inflation and overfitting in high-dimensional linear regression with multi-level categorical predictors. The authors propose a joint regularization method that simultaneously induces clustering among category-specific coefficients and enforces sparsity, thereby effectively reducing model complexity. A novel mixed-integer programming framework is introduced to obtain exact solutions, complemented by a dynamic programming-based exact algorithm for single-variable optimization and a fast block coordinate descent algorithm for approximate large-scale inference. Both theoretical analysis and empirical evaluations demonstrate that the proposed approach significantly outperforms existing methods on synthetic and real-world datasets, offering strong theoretical guarantees and empirical advantages in terms of prediction accuracy and recovery of true coefficient clustering structures.
📝 Abstract
We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We present novel mixed integer programming formulations for our estimator, and develop a custom row generation procedure to speed up the exact off-the-shelf solvers. We also propose a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent. As the main building block of our algorithm, we develop an exact algorithm for the univariate case based on dynamic programming, which can be of independent interest. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

categorical features
high-dimensional regression
model compression
clustering
sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

categorical features
model compression
mixed integer programming
sparsity regularization
dynamic programming
🔎 Similar Papers
No similar papers found.