Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation - Technical Report for IROS 2025 RoboSense Challenge Track 4

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-modal drone navigation faces significant challenges in achieving fine-grained semantic matching between natural language queries and aerial visual scenes. To address this, we propose a two-stage retrieval refinement framework. In the first stage, a vision-language model (VLM) generates descriptive textual captions for candidate aerial images, mitigating viewpoint and modality discrepancies. In the second stage, we introduce a multimodal semantic re-ranking module that jointly embeds text and images to enable deep semantic alignment. This approach refines coarse retrieval results through generative, VLM-driven re-ranking, substantially improving cross-view image retrieval accuracy. Evaluated on the RoboSense 2025 Challenge, our method achieves second place (TOP-2), with consistent improvements of +5% in Recall@1, Recall@5, and Recall@10. These results demonstrate the effectiveness and generalizability of VLM-based generative re-ranking for complex aerial navigation tasks.

Technology Category

Application Category

📝 Abstract
Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-modal drone navigation via semantic retrieval
Improving fine-grained text-image matching in aerial scenes
Refining coarse image rankings using caption-guided reranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage retrieval refinement method enhances baseline ranking
Vision-Language Model generates captions for candidate images
Multimodal similarity computation enables fine-grained semantic reranking
🔎 Similar Papers
No similar papers found.
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
Erjia Xiao
Erjia Xiao
The Hong Kong University of Science and Technology
Machine Learning
Y
Yuchen Zhang
Georgia Institute of Technology, Xiaomi EV
H
Haoxiang Fu
National University of Singapore
R
Ruibin Hu
The Chinese University of Hong Kong
Y
Yanbiao Ma
Renmin University of China
Wenbo Ding
Wenbo Ding
UNIVERSITY AT BUFFALO
securityMachine Learning
L
Long Chen
Xiaomi EV
H
Hangjun Ye
Xiaomi EV
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language