HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Fine-grained vision-language alignment in natural language-guided drone (NLGD) navigation remains challenging due to wide-field-of-view imagery and semantically complex, ambiguous, or incomplete textual descriptions. Method: This paper proposes a hierarchical cross-granularity contrastive and matching learning framework that abandons conventional hierarchical paradigms reliant on precise entity segmentation and strict containment relations. It introduces region-global image-text contrast (RG-ITC) and matching (RG-ITM) mechanisms, augmented by momentum contrastive distillation (MCD) to enhance robustness against imprecise language and improve zero-shot generalization. Results: On GeoText-1652, the method achieves Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval); on the unseen ERA dataset, it attains a zero-shot mean Recall (mR) of 39.93%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.

Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language understanding in drone scenarios

Overcomes limitations in fine-grained semantic alignment

Enhances robustness against incomplete or ambiguous text descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cross-granularity contrastive learning without precise partitioning

Region-global image-text matching for semantic consistency

Momentum contrast and distillation mechanism for robustness

🔎 Similar Papers

No similar papers found.