HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained vision-language alignment in natural language-guided drone (NLGD) navigation remains challenging due to wide-field-of-view imagery and semantically complex, ambiguous, or incomplete textual descriptions. Method: This paper proposes a hierarchical cross-granularity contrastive and matching learning framework that abandons conventional hierarchical paradigms reliant on precise entity segmentation and strict containment relations. It introduces region-global image-text contrast (RG-ITC) and matching (RG-ITM) mechanisms, augmented by momentum contrastive distillation (MCD) to enhance robustness against imprecise language and improve zero-shot generalization. Results: On GeoText-1652, the method achieves Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval); on the unseen ERA dataset, it attains a zero-shot mean Recall (mR) of 39.93%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language understanding in drone scenarios
Overcomes limitations in fine-grained semantic alignment
Enhances robustness against incomplete or ambiguous text descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cross-granularity contrastive learning without precise partitioning
Region-global image-text matching for semantic consistency
Momentum contrast and distillation mechanism for robustness
🔎 Similar Papers
No similar papers found.
H
Hao Ruan
Department of Artificial Intelligence, Xiamen University, Xiamen, China
J
Jinliang Lin
Department of Artificial Intelligence, Xiamen University, Xiamen, China
Y
Yingxin Lai
Department of Artificial Intelligence, Xiamen University, Xiamen, China
Zhiming Luo
Zhiming Luo
Xiamen University
Computer VisionDeep LearningMachine Learning
Shaozi Li
Shaozi Li
厦门大学智能科学与技术系教授
人工智能、计算机视觉、机器学习