Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation - Technical Report for IROS 2025 RoboSense Challenge Track 4

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Cross-modal drone navigation faces significant challenges in achieving fine-grained semantic matching between natural language queries and aerial visual scenes. To address this, we propose a two-stage retrieval refinement framework. In the first stage, a vision-language model (VLM) generates descriptive textual captions for candidate aerial images, mitigating viewpoint and modality discrepancies. In the second stage, we introduce a multimodal semantic re-ranking module that jointly embeds text and images to enable deep semantic alignment. This approach refines coarse retrieval results through generative, VLM-driven re-ranking, substantially improving cross-view image retrieval accuracy. Evaluated on the RoboSense 2025 Challenge, our method achieves second place (TOP-2), with consistent improvements of +5% in Recall@1, Recall@5, and Recall@10. These results demonstrate the effectiveness and generalizability of VLM-based generative re-ranking for complex aerial navigation tasks.

Technology Category

Application Category

📝 Abstract

Cross-modal drone navigation remains a challenging task in robotics, requiring efficient retrieval of relevant images from large-scale databases based on natural language descriptions. The RoboSense 2025 Track 4 challenge addresses this challenge, focusing on robust, natural language-guided cross-view image retrieval across multiple platforms (drones, satellites, and ground cameras). Current baseline methods, while effective for initial retrieval, often struggle to achieve fine-grained semantic matching between text queries and visual content, especially in complex aerial scenes. To address this challenge, we propose a two-stage retrieval refinement method: Caption-Guided Retrieval System (CGRS) that enhances the baseline coarse ranking through intelligent reranking. Our method first leverages a baseline model to obtain an initial coarse ranking of the top 20 most relevant images for each query. We then use Vision-Language-Model (VLM) to generate detailed captions for these candidate images, capturing rich semantic descriptions of their visual content. These generated captions are then used in a multimodal similarity computation framework to perform fine-grained reranking of the original text query, effectively building a semantic bridge between the visual content and natural language descriptions. Our approach significantly improves upon the baseline, achieving a consistent 5% improvement across all key metrics (Recall@1, Recall@5, and Recall@10). Our approach win TOP-2 in the challenge, demonstrating the practical value of our semantic refinement strategy in real-world robotic navigation scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-modal drone navigation via semantic retrieval

Improving fine-grained text-image matching in aerial scenes

Refining coarse image rankings using caption-guided reranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage retrieval refinement method enhances baseline ranking

Vision-Language Model generates captions for candidate images

Multimodal similarity computation enables fine-grained semantic reranking

🔎 Similar Papers

No similar papers found.