🤖 AI Summary
Existing vision-language navigation (VLN) methods for UAVs focus on coarse-grained, long-range targets and fail to meet the precision requirements of low-altitude last-mile delivery. Method: This paper formally defines and addresses the fine-grained aerial last-mile delivery VLN task. We propose a modular multimodal large model architecture that integrates a lightweight large language model (LLM) with a vision-language model (VLM), enabling joint natural language understanding, floor-level localization, fine-grained object detection, and autonomous action decision-making. We further construct CARLA-based Vision-Language Delivery (VLD), the first benchmark dataset tailored for aerial last-mile delivery. Contribution/Results: Extensive end-to-end evaluation and ablation studies on VLD demonstrate significant improvements in navigation accuracy and environmental robustness over baseline approaches.
📝 Abstract
The growing demand for intelligent logistics, particularly fine-grained terminal delivery, underscores the need for autonomous UAV (Unmanned Aerial Vehicle)-based delivery systems. However, most existing last-mile delivery studies rely on ground robots, while current UAV-based Vision-Language Navigation (VLN) tasks primarily focus on coarse-grained, long-range goals, making them unsuitable for precise terminal delivery. To bridge this gap, we propose LogisticsVLN, a scalable aerial delivery system built on multimodal large language models (MLLMs) for autonomous terminal delivery. LogisticsVLN integrates lightweight Large Language Models (LLMs) and Visual-Language Models (VLMs) in a modular pipeline for request understanding, floor localization, object detection, and action-decision making. To support research and evaluation in this new setting, we construct the Vision-Language Delivery (VLD) dataset within the CARLA simulator. Experimental results on the VLD dataset showcase the feasibility of the LogisticsVLN system. In addition, we conduct subtask-level evaluations of each module of our system, offering valuable insights for improving the robustness and real-world deployment of foundation model-based vision-language delivery systems.