CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the lack of datasets tailored for drone-based continuous tracking of dynamic targets while maintaining visibility, as existing vision-language navigation benchmarks predominantly focus on static goals. To bridge this gap, the authors introduce CosFly-Track, the first large-scale multimodal drone tracking dataset supporting dynamic target following, featuring aligned multimodal streams including RGB, depth, semantic segmentation, pose, target state, and bilingual instructions. Its core innovation is the MuCO (Multi-Constraint Optimizer), which jointly optimizes target visibility, viewpoint quality, obstacle avoidance, trajectory smoothness, and motion feasibility in continuous 3D space—avoiding discretization artifacts from grid-based methods—and leverages BVH-accelerated collision and visibility queries for efficient trajectory generation. Fine-tuning seven vision-language models on this dataset achieves tracking success rates (SR@1m) of 78.3%–95.6%, representing improvements of 53–69 percentage points over zero-shot baselines.
📝 Abstract
Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.
Problem

Research questions and friction points this paper is trying to address.

UAV visual tracking
moving target following
vision-language navigation
training data scarcity
dynamic target tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-constraint trajectory optimization
UAV visual tracking
continuous 3D planning
BVH-accelerated collision detection
vision-language navigation