Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the challenges of visual degradation and semantic misalignment between drone-captured aerial images and eyewitness textual descriptions, which arise from significant differences in viewpoint and altitude. To tackle this, the authors propose a cross-modal fuzzy alignment network that dynamically models token-level textual reliability using fuzzy membership functions. The method leverages ground-view images as an intermediate bridge and introduces a context-aware dynamic alignment mechanism that jointly exploits direct matching and proxy-assisted alignment. Additionally, the study presents AERI-PEDES, a high-quality, large-scale benchmark dataset, and incorporates a chain-of-thought–based text generation strategy. Extensive experiments demonstrate that the proposed model significantly outperforms existing approaches on both the AERI-PEDES and TBAPR datasets, exhibiting strong robustness and effectiveness in text-to-aerial pedestrian retrieval tasks.

Technology Category

Application Category

📝 Abstract

Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.

Problem

Research questions and friction points this paper is trying to address.

text-aerial person retrieval

cross-modal alignment

UAV imagery

semantic inconsistency

visual degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Fuzzy Alignment

Text-Aerial Person Retrieval

Fuzzy Token Alignment