🤖 AI Summary
Deep neural networks suffer from excessive computational overhead and energy consumption due to over-parameterization, hindering deployment on resource-constrained devices. To address this, we propose an optimal transport-based, fine-tuning-free layer compression method: critical intermediate layers are directly removed by minimizing the Max-Sliced Wasserstein Distance (MSWD) between feature distributions of adjacent layers. This is the first work to incorporate MSWD as a regularization objective for layer compression—eliminating the need for retraining, pruning, or knowledge distillation. Evaluated on image classification tasks, our approach fully removes multiple intermediate layers with <0.5% accuracy degradation, significantly reduces FLOPs, and preserves end-to-end inference consistency. Our key contributions include: (i) a theoretically grounded, lossless deep collapse framework; (ii) zero-shot, fine-tuning-free compression; and (iii) efficient modeling of structural redundancy in deep networks.
📝 Abstract
Although deep neural networks are well-known for their remarkable performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, which stalls their widespread adoption. In this paper, we present an optimal transport method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, with almost no performance loss and without requiring any finetuning. We assess the effectiveness of our method on traditional image classification setups. We commit to releasing the source code upon acceptance of the article.