🤖 AI Summary
This work addresses the challenges of stable drone tracking of moving targets and multimodal perception coordination in complex dynamic environments. We propose a modular seven-step pipeline that converts 3D scenes into structured obstacle representations to support trajectory planning and renders multimodal data comprising RGB images, depth maps, semantic segmentation masks, and natural language instructions. Built upon the CARLA platform, our framework introduces a novel multimodal simulation environment with fixed field-of-view scaling, integrates two distinct trajectory planning paradigms, and releases CosFly-Track—the first large-scale drone tracking dataset—featuring 250 validation trajectories and approximately 100,000 images annotated with full 6-DoF poses. This benchmark provides a scalable foundation for advancing research in drone navigation and aerial-ground collaboration.
📝 Abstract
We present CosFly, a box-structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly-Track, a large-scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7-step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi-modal sensor data -- including RGB images, high-precision depth maps, and semantic segmentation masks -- paired with natural language navigation instructions. A key feature is the support for configurable fixed-FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera-intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi-modal rendering with 6-DOF pose annotations, quality inspection, and teacher-student caption generation. We analyze two trajectory-planning paradigms for aerial target tracking: a conventional two-stage pipeline with front-end candidate generation and backend refinement, and a direct gradient-based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly-Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6-DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial-ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi-modal perception across diverse environments.