HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Vision Transformers (ViTs) are challenging to deploy on edge devices due to their high computational and memory demands, while existing structured pruning methods suffer from coarse or fixed granularity and rely on multi-stage pipelines. This work proposes HiAP, a hierarchical automated pruning framework that jointly optimizes both macro-level structures (e.g., attention heads and FFN blocks) and micro-level components (e.g., intra-head dimensions and FFN neurons) within a single-stage, end-to-end training process. HiAP employs Gumbel-Sigmoid stochastic gating and continuous relaxation to automatically search for efficient subnetworks without requiring manually specified sparsity targets or heuristic rules. By incorporating structural feasibility constraints and an analytical FLOPs loss, HiAP achieves Pareto-optimal accuracy-efficiency trade-offs on ImageNet for models such as DeiT-Small, matching the performance of complex multi-stage approaches while significantly simplifying deployment.

Technology Category

Application Category

πŸ“ Abstract
Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
structured pruning
computational resources
memory bandwidth
edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-granular pruning
stochastic gating
end-to-end auto-pruning
Vision Transformers
Gumbel-Sigmoid
πŸ”Ž Similar Papers
No similar papers found.
Andy Li
Andy Li
Monash University
MAPF
A
Aiden Durrant
School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; School of Computing Sciences, University of East Anglia, Norwich, UK
Milan Markovic
Milan Markovic
Interdisciplinary Fellow in Data & AI - University of Aberdeen, UK
AccountabilityComplianceTransparencyProvenanceSemantic Web
G
Georgios Leontidis
School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, UK; Department of Physics and Technology, UiT The Arctic University of Norway, TromsΓΈ, Norway