🤖 AI Summary
Deep neural networks often underperform tree-based models on tabular data due to irrelevant features, feature heterogeneity, and local irregularities. To address this, this work proposes the LassoFlexNet architecture, which enhances model expressivity through five inductive biases: it employs per-feature embeddings to separately capture linear and nonlinear marginal contributions, integrates a grouped Lasso penalty for sparse variable selection, and introduces a novel Sequential Hierarchical Proximal Adaptive Gradient optimizer incorporating exponential moving averages (EMA). This design effectively breaks undesirable rotational invariance inherent in standard neural networks. Evaluated across 52 benchmark datasets, LassoFlexNet matches or surpasses leading tree-based models, achieving up to a 10% relative performance gain while maintaining strong interpretability.
📝 Abstract
Despite their dominance in vision and language, deep neural networks often underperform relative to tree-based models on tabular data. To bridge this gap, we incorporate five key inductive biases into deep learning: robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability. We propose \emph{LassoFlexNet}, an architecture that evaluates the linear and nonlinear marginal contribution of each input via Per-Feature Embeddings, and sparsely selects relevant variables using a Tied Group Lasso mechanism. Because these components introduce optimization challenges that destabilize standard proximal methods, we develop a \emph{Sequential Hierarchical Proximal Adaptive Gradient optimizer with exponential moving averages (EMA)} to ensure stable convergence. Across $52$ datasets from three benchmarks, LassoFlexNet matches or outperforms leading tree-based models, achieving up to a $10$\% relative gain, while maintaining Lasso-like interpretability. We substantiate these empirical results with ablation studies and theoretical proofs confirming the architecture's enhanced expressivity and structural breaking of undesired rotational invariance.