🤖 AI Summary
To address the limitations of hardware-constrained mobile target recognition and poor generalization of closed-set models in UAV-based search and rescue, this paper proposes a cloud-semantic-driven lightweight open-vocabulary tracking paradigm. Our method integrates vision-language multimodal embeddings with semantic alignment distillation to construct a lightweight temporal modeling module, and introduces an edge-cloud collaborative inference architecture. It achieves, for the first time, zero-shot, natural-language-guided (e.g., “person wearing a red shirt”) persistent tracking of dynamic targets without task-specific training. Evaluated on a real-world UAV video dataset, our approach attains an 86.3% tracking success rate and achieves sub-45 ms inference latency on the Jetson AGX platform—significantly outperforming existing methods—while demonstrating robustness across substantial appearance, pose, and illumination variations.
📝 Abstract
Nowadays, unmanned aerial vehicles (UAVs) are commonly used in search and rescue scenarios to gather information in the search area. The automatic identification of the person searched for in aerial footage could increase the autonomy of such systems, reduce the search time, and thus increase the missed person's chances of survival. In this paper, we present a novel approach to perform semantically conditioned open vocabulary object tracking that is specifically designed to cope with the limitations of UAV hardware. Our approach has several advantages. It can run with verbal descriptions of the missing person, e.g., the color of the shirt, it does not require dedicated training to execute the mission and can efficiently track a potentially moving person. Our experimental results demonstrate the versatility and efficacy of our approach.