🤖 AI Summary
This paper addresses the sharp performance degradation of initialization-based pruning methods for “winning ticket” search under high sparsity. We propose Concrete Ticket Search (CTS), which formulates subnetwork discovery as an end-to-end combinatorial optimization problem. CTS employs Concrete relaxation to handle discrete architectural decisions, introduces GRADBALANCE—a gradient rebalancing mechanism—to stabilize training under extreme sparsity, and adopts a CTS-KL pruning objective based on minimizing reverse KL divergence. Crucially, CTS enables search-from-initialization without hyperparameter tuning. On CIFAR-10, ResNet-20 achieves 74.0% accuracy at 99.3% sparsity in just 7.9 minutes—substantially outperforming LTR and state-of-the-art initialization-based pruning methods. Our key contributions are: (i) the first integration of knowledge distillation into the winning ticket search objective, and (ii) gradient rebalancing to mitigate training instability in ultra-sparse regimes.
📝 Abstract
The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks ('winning tickets') within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI's reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS's subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.