ArmFormer: Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing weapon detection methods predominantly rely on object detection, lacking pixel-level fine-grained segmentation capability; meanwhile, semantic segmentation models often struggle to balance accuracy and efficiency, hindering deployment on resource-constrained edge devices. To address this, we propose ArmFormer—a lightweight Transformer architecture that integrates CBAM (Convolutional Block Attention Module) with MixVision Transformer to construct an attention-enhanced encoder and a novel “hamburger-style” decoder, enabling precise pixel-wise segmentation and classification of multi-category weapons. With only 3.66M parameters and 4.886G FLOPs, ArmFormer achieves 80.64% mIoU and 89.13% mFscore on a five-class weapon dataset, while attaining 82.26 FPS inference speed on standard hardware. It significantly outperforms heavy-weight models in both accuracy and efficiency—accelerating inference by 48×—thereby establishing a new paradigm for real-time, edge-deployable security monitoring.

Technology Category

Application Category

📝 Abstract
The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer-based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger decoder to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Achieving pixel-level weapon segmentation for real-time threat assessment
Overcoming computational inefficiency of existing semantic segmentation models
Enabling multi-class weapon classification on resource-constrained edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight transformer integrates CBAM with MixVisionTransformer
Combines CBAM-enhanced encoder with attention-integrated hamburger decoder
Achieves real-time multi-class weapon segmentation on edge devices
🔎 Similar Papers
No similar papers found.
A
Akhila Kambhatla
School of Computing, Southern Illinois University, Carbondale, IL, USA
Taminul Islam
Taminul Islam
Research Assistant, Southern Illinois University Carbondale
Computer VisionDeep LearningMachine LearningObject DetectionSegmentation
K
Khaled R Ahmed
School of Computing, Southern Illinois University, Carbondale, IL, USA