🤖 AI Summary
To address low localization accuracy in weakly supervised video object localization (WSVOL) for long videos and objects with subtle motion, this paper proposes a collaborative class activation mapping (CAM) method that imposes no inter-frame positional constraints. Our key innovation is the first incorporation of object color consistency as a Conditional Random Field (CRF) loss term into CAM training, enabling direct cross-frame and cross-pixel response constraints and localization refinement—thereby significantly improving robustness to long-range temporal dependencies. The method jointly leverages CAMs, color-space constraints, and a weakly supervised collaborative localization framework, without requiring optical flow or explicit motion modeling. Evaluated on unconstrained video datasets such as YouTube-Objects, our approach achieves new state-of-the-art performance, particularly excelling in scenarios involving large object displacements and extended temporal sequences.
📝 Abstract
Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.