See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenge of reliably identifying safe drop zones for autonomous delivery drones in complex urban environments. It introduces, for the first time, a vision-language model (VLM) to this task, which iteratively refines semantic and geometric cues by fusing monocular depth gradients with open-vocabulary object detection masks. The approach dynamically adjusts category-specific prompts to identify alternative safe areas when primary drop zones are unavailable, thereby generating high-precision safety maps. Evaluated on a newly curated urban delivery dataset, the method significantly outperforms existing approaches, achieving state-of-the-art performance in both accuracy and Intersection over Union (IoU) across multiple thresholds. This demonstrates a unified capability for semantic-geometric co-reasoning and adaptive response to dynamic environmental conditions.

Technology Category

Application Category

📝 Abstract
Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.
Problem

Research questions and friction points this paper is trying to address.

safe zone detection
autonomous drone delivery
vision-language model
semantic segmentation
monocular depth
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Safety Zone Detection
Depth-Semantic Fusion
Autonomous Delivery Drone
Open-Vocabulary Detection