MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Cross-modal retrieval of drone videos faces challenges due to top-down viewpoints, structural homogeneity, and semantically diverse object compositions. This paper formally defines the drone video–text retrieval task for the first time and proposes a multi-semantic adaptive mining framework. The method comprises four core components: adaptive semantic generation via fine-grained vision–language interaction; distribution-driven semantic learning through object-region feature fusion; diversity-aware semantic constraints; and cross-modal interactive pooling for dynamic inter-frame variation modeling. These modules jointly enhance semantic depth, feature robustness, and noise resilience. Extensive experiments on two newly constructed drone video–text benchmark datasets demonstrate that our approach significantly outperforms existing cross-modal retrieval models, achieving a 12.6% improvement in mean Average Precision (mAP). The framework exhibits strong practicality and scalability.

Technology Category

Application Category

📝 Abstract

With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addressing drone video-text retrieval challenges from aerial perspectives

Overcoming structural homogeneity and diverse semantic combinations in drone footage

Enhancing cross-modal interaction robustness against complex background interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-semantic adaptive learning for drone video understanding

Cross-modal interactive feature fusion pooling mechanism

Fine-grained word-frame interactions with adaptive semantic construction

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs