🤖 AI Summary
This work addresses the challenges of limited onboard computational resources and decentralized coordination in enabling multiple unmanned aerial vehicles (UAVs) to perform efficient and robust open-vocabulary semantic goal navigation in unknown environments. To this end, the authors propose GoalSwarm, a novel framework that integrates the zero-shot foundation model SAM3 into a multi-UAV system for the first time. The approach constructs a lightweight, shared 2D semantic occupancy map and fuses multi-view confidence estimates to generate a Bayesian value map that quantifies goal relevance. A decentralized coordination mechanism is further introduced, combining semantic frontier extraction, utility-based bidding, and spatial separation penalties for effective task allocation. Experimental results demonstrate that GoalSwarm significantly enhances open-vocabulary goal navigation efficiency and collaborative exploration performance under low computational overhead while effectively minimizing redundant coverage.
📝 Abstract
Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open-vocabulary object-goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi-agent coordination. We present GoalSwarm, a fully decentralized multi-UAV framework for zero-shot semantic object-goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top-down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero-shot foundation model -- SAM3 for open vocabulary detection and pixel-level segmentation, enabling open-vocabulary target identification without task-specific training; (2) a Bayesian Value Map that fuses multi-viewpoint detection confidences into a per-pixel goal-relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost-utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.