UnIT: Scalable Unstructured Inference-Time Pruning for MAC-efficient Neural Inference on MCUs

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pruning methods predominantly rely on structured sparsity and training-/compilation-time optimizations, limiting their ability to exploit fine-grained computational redundancy on microcontrollers (MCUs) lacking SIMD support. This work proposes UnIT—the first retraining-free, unstructured, inference-time pruning framework tailored for MCUs. Its core is an activation-driven dynamic skipping mechanism: leveraging input-dependent activation patterns, it replaces multiplication operations with lightweight threshold comparisons, augmented by three embedded-friendly fast division approximations and a threshold reuse strategy—enabling per-MAC-level pruning under irregular sparsity. Evaluated on the MSP430 MCU, UnIT reduces MAC operations by up to 82.03%, achieves 84.19% inference acceleration and 84.38% energy reduction, with only 0.48–7% accuracy degradation. Notably, it outperforms retraining-based baselines even under domain shift.

Technology Category

Application Category

📝 Abstract
Existing pruning methods are typically applied during training or compile time and often rely on structured sparsity. While compatible with low-power microcontrollers (MCUs), structured pruning underutilizes the opportunity for fine-grained efficiency on devices without SIMD support or parallel compute. To address these limitations, we introduce UnIT (Unstructured Inference-Time pruning), a lightweight method that dynamically identifies and skips unnecessary multiply-accumulate (MAC) operations during inference, guided by input-specific activation patterns. Unlike structured pruning, UnIT embraces irregular sparsity and does not require retraining or hardware specialization. It transforms pruning decisions into lightweight comparisons, replacing multiplications with threshold checks and approximated divisions. UnIT further optimizes compute by reusing threshold computations across multiple connections and applying layer- and group-specific pruning sensitivity. We present three fast, hardware-friendly division approximations tailored to the capabilities of common embedded platforms. Demonstrated on the MSP430 microcontroller, UnIT achieves 11.02% to 82.03% MAC reduction, 27.30% to 84.19% faster inference, and 27.33% to 84.38% lower energy consumption compared to training-time pruned models, while maintaining accuracy with 0.48-7%. Under domain shift, UnIT matches or exceeds the accuracy of retrained models while requiring significantly fewer MACs. These results establish unstructured inference-time pruning as a viable and practical solution for efficient, retraining-free deployment of deep neural networks on MCUs.
Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained pruning during neural network inference on MCUs
Reduces multiply-accumulate operations without retraining or hardware changes
Improves energy efficiency and speed while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic MAC skipping via input-specific activation patterns
Lightweight comparisons replace multiplications with thresholds
Hardware-friendly division approximations for embedded platforms
🔎 Similar Papers
No similar papers found.