🤖 AI Summary
This work addresses text-driven saliency detection in 360° video—a previously unexplored problem. We propose TSalV360, the first end-to-end method for this task. To enable systematic study, we introduce TSV360, a large-scale benchmark comprising 16,000 image-text–video triplets, featuring equirectangular projection (ERP) frames and fine-grained textual descriptions. Methodologically, TSalV360 integrates a viewport-aware spatiotemporal cross-attention mechanism and a cross-modal similarity estimation module, enabling deep fusion of vision-language representations grounded in state-of-the-art multimodal foundation models. This yields dynamic, interpretable, text-guided saliency predictions. Extensive experiments on TSV360 demonstrate that TSalV360 significantly outperforms vision-only state-of-the-art methods, achieving breakthrough performance in customized object/event localization accuracy and cross-modal alignment fidelity. Our work establishes a new paradigm for content understanding and interactive analysis in immersive media.
📝 Abstract
In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.