Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

๐Ÿ“… 2026-03-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of visual degradation and semantic misalignment between drone-captured aerial images and eyewitness textual descriptions, which arise from significant differences in viewpoint and altitude. To tackle this, the authors propose a cross-modal fuzzy alignment network that dynamically models token-level textual reliability using fuzzy membership functions. The method leverages ground-view images as an intermediate bridge and introduces a context-aware dynamic alignment mechanism that jointly exploits direct matching and proxy-assisted alignment. Additionally, the study presents AERI-PEDES, a high-quality, large-scale benchmark dataset, and incorporates a chain-of-thoughtโ€“based text generation strategy. Extensive experiments demonstrate that the proposed model significantly outperforms existing approaches on both the AERI-PEDES and TBAPR datasets, exhibiting strong robustness and effectiveness in text-to-aerial pedestrian retrieval tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.
Problem

Research questions and friction points this paper is trying to address.

text-aerial person retrieval
cross-modal alignment
UAV imagery
semantic inconsistency
visual degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Fuzzy Alignment
Text-Aerial Person Retrieval
Fuzzy Token Alignment
Ground-view Bridge Agent
AERI-PEDES Benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yifei Deng
State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology; School of Computer Science and Technology, Anhui University
Chenglong Li
Chenglong Li
Professor, The University of Florida
Drug DesignDrug DiscoveryMolecular RecognitionMolecular ModelingProtein structure and Dynamics
Yuyang Zhang
Yuyang Zhang
Graduate Student, Harvard University
Reinforcement LearningControl Theory
G
Guyue Hu
School of Artificial Intelligence, Anhui University
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis