🤖 AI Summary
Deploying large deep models (e.g., BERT, ResNet) on resource-constrained edge devices remains challenging. To address this, we propose a causally grounded knowledge distillation method based on Random Matrix Theory (RMT). Unlike conventional pruning or heuristic low-rank approximations, our approach mathematically identifies and preserves information-rich principal directions by analyzing the spectral distribution of hidden-layer representations—enabling layer-wise causal structural compression. Integrated with self-distillation, it jointly enforces inter-layer causal reduction and representation stability. On multiple benchmark tasks, the compressed models achieve an 80% parameter reduction, with only a 2% accuracy drop, a 2.8× inference speedup, and a 47% power consumption reduction. This work is the first to systematically incorporate RMT into knowledge distillation, establishing a theoretically rigorous, interpretable, and causally principled paradigm for model compression.
📝 Abstract
Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.