🤖 AI Summary
Existing weapon detection methods predominantly rely on object detection, lacking pixel-level fine-grained segmentation capability; meanwhile, semantic segmentation models often struggle to balance accuracy and efficiency, hindering deployment on resource-constrained edge devices. To address this, we propose ArmFormer—a lightweight Transformer architecture that integrates CBAM (Convolutional Block Attention Module) with MixVision Transformer to construct an attention-enhanced encoder and a novel “hamburger-style” decoder, enabling precise pixel-wise segmentation and classification of multi-category weapons. With only 3.66M parameters and 4.886G FLOPs, ArmFormer achieves 80.64% mIoU and 89.13% mFscore on a five-class weapon dataset, while attaining 82.26 FPS inference speed on standard hardware. It significantly outperforms heavy-weight models in both accuracy and efficiency—accelerating inference by 48×—thereby establishing a new paradigm for real-time, edge-deployable security monitoring.
📝 Abstract
The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.