UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe occlusion, dense small objects, and irregular target shapes in UAV-based object detection, this paper proposes UAVD-Mamba, a multimodal detection framework built upon the Mamba architecture. It introduces a novel Deformable Token Mamba module that integrates adaptive deformable sampling with RGB-infrared modality complementarity; incorporates cross-spatial–cross-channel attention to enhance geometric modeling; and designs a Fusion Mamba Block alongside an improved YOLO-style detection neck (DNM). Evaluated on the DroneVehicle dataset, UAVD-Mamba achieves a 3.6% higher mAP than OAFA, with substantial gains in small-object detection performance. The results demonstrate the framework’s robustness and the effectiveness of its multimodal fusion strategy.

Technology Category

Application Category

📝 Abstract
Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.
Problem

Research questions and friction points this paper is trying to address.

Detects UAV objects despite occlusions and irregular shapes
Improves multimodal feature fusion using RGB and infrared data
Enhances small object detection via multiscale deformable tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deformable Token Mamba Block enhances geometric adaptability
Separate DTMBs for RGB and IR optimize feature complementarity
Stacked DTMBs and DNM improve multiscale object detection
🔎 Similar Papers
No similar papers found.
W
Wei Li
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China
J
Jiaman Tang
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China
Y
Yang Li
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China
Beihao Xia
Beihao Xia
Huazhong University of Science and Technology
Trajectory Prediction
L
Ligang Tan
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China
H
Hongmao Qin
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China