UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address severe occlusion, dense small objects, and irregular target shapes in UAV-based object detection, this paper proposes UAVD-Mamba, a multimodal detection framework built upon the Mamba architecture. It introduces a novel Deformable Token Mamba module that integrates adaptive deformable sampling with RGB-infrared modality complementarity; incorporates cross-spatial–cross-channel attention to enhance geometric modeling; and designs a Fusion Mamba Block alongside an improved YOLO-style detection neck (DNM). Evaluated on the DroneVehicle dataset, UAVD-Mamba achieves a 3.6% higher mAP than OAFA, with substantial gains in small-object detection performance. The results demonstrate the framework’s robustness and the effectiveness of its multimodal fusion strategy.

Technology Category

Application Category

📝 Abstract

Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.

Problem

Research questions and friction points this paper is trying to address.

Detects UAV objects despite occlusions and irregular shapes

Improves multimodal feature fusion using RGB and infrared data

Enhances small object detection via multiscale deformable tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deformable Token Mamba Block enhances geometric adaptability

Separate DTMBs for RGB and IR optimize feature complementarity

Stacked DTMBs and DNM improve multiscale object detection

🔎 Similar Papers

No similar papers found.