🤖 AI Summary
This study addresses the significant performance degradation of existing optical-image-based drone traffic understanding methods under low-illumination conditions such as nighttime or fog, as well as their limited capacity to model complex traffic behaviors. To overcome these challenges, this work proposes CTCNet, a cross-spectral traffic cognition network that fuses optical and thermal infrared modalities and incorporates an external traffic rule knowledge base. The approach introduces two key innovations: a prototype-guided knowledge embedding (PGKE) mechanism for structured injection of domain-specific rules, and a quality-aware spectral compensation (QASC) module to enable contextual complementarity between modalities. Additionally, the authors construct Traffic-VQA, the first large-scale dual-spectrum drone traffic visual question answering benchmark, comprising 8,180 image pairs and 1.3 million question-answer pairs. Experiments demonstrate that CTCNet substantially outperforms existing methods on both perception and cognition tasks.
📝 Abstract
Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.