TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses text-driven saliency detection in 360° video—a previously unexplored problem. We propose TSalV360, the first end-to-end method for this task. To enable systematic study, we introduce TSV360, a large-scale benchmark comprising 16,000 image-text–video triplets, featuring equirectangular projection (ERP) frames and fine-grained textual descriptions. Methodologically, TSalV360 integrates a viewport-aware spatiotemporal cross-attention mechanism and a cross-modal similarity estimation module, enabling deep fusion of vision-language representations grounded in state-of-the-art multimodal foundation models. This yields dynamic, interpretable, text-guided saliency predictions. Extensive experiments on TSV360 demonstrate that TSalV360 significantly outperforms vision-only state-of-the-art methods, achieving breakthrough performance in customized object/event localization accuracy and cross-modal alignment fidelity. Our work establishes a new paradigm for content understanding and interactive analysis in immersive media.

Technology Category

Application Category

📝 Abstract
In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.
Problem

Research questions and friction points this paper is trying to address.

Detecting salient objects in 360-degree videos using text descriptions
Creating a dataset with frames, text, and saliency maps for training
Developing a method that integrates vision-language models for cross-modal analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language model for data representation
Integrates similarity estimation module for cross-modal analysis
Uses viewport spatio-temporal cross-attention mechanism
🔎 Similar Papers
No similar papers found.