🤖 AI Summary
Existing hyperspectral object tracking (HOT) datasets suffer from visual bias: targets exhibit salient appearances, causing trackers to rely on visual cues from pseudo-color images while neglecting critical spectral information—leading to severe performance degradation under camouflage, occlusion, or other appearance-unreliable conditions. To address this, we propose a new task—hyperspectral camouflaged object tracking (HCOT)—and introduce BihoT, the first dedicated benchmark comprising 49 videos and 41,912 frames, featuring strong camouflage, high spectral diversity, and frequent occlusion. We further design the Spectral-Prompt-Driven Interference-Aware Network (SPDAN), which jointly models 3D/2D spectral-spatial features, fine-tunes an RGB tracker via spectral prompt guidance, and incorporates distribution-statistics-based occlusion modeling with motion correction. SPDAN achieves state-of-the-art performance on BihoT and significantly enhances robustness against occlusion and appearance ambiguity on general HOT benchmarks.
📝 Abstract
Hyperspectral object tracking (HOT) has many important applications, particularly in scenes where objects are camouflaged. The existing trackers can effectively retrieve objects via band regrouping because of the bias in the existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows a tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images (HSIs) without extracting spectral features. To tackle this bias, the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, BihoT, consisting of 41912 HSIs covering 49 video sequences. The dataset covers various artificial camouflage scenes, where objects have similar appearances, diverse spectrums, and frequent occlusion (OCC), making it a challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions to form a refined prompt representation. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background and corrects the deterioration of the tracking performance via a novel motion predictor. Extensive experiments demonstrate that our proposed SPDAN achieves the state-of-the-art performance on the proposed BihoT and other HOT datasets.