TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses text-driven saliency detection in 360° video—a previously unexplored problem. We propose TSalV360, the first end-to-end method for this task. To enable systematic study, we introduce TSV360, a large-scale benchmark comprising 16,000 image-text–video triplets, featuring equirectangular projection (ERP) frames and fine-grained textual descriptions. Methodologically, TSalV360 integrates a viewport-aware spatiotemporal cross-attention mechanism and a cross-modal similarity estimation module, enabling deep fusion of vision-language representations grounded in state-of-the-art multimodal foundation models. This yields dynamic, interpretable, text-guided saliency predictions. Extensive experiments on TSV360 demonstrate that TSalV360 significantly outperforms vision-only state-of-the-art methods, achieving breakthrough performance in customized object/event localization accuracy and cross-modal alignment fidelity. Our work establishes a new paradigm for content understanding and interactive analysis in immersive media.

Technology Category

Application Category

📝 Abstract

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.

Problem

Research questions and friction points this paper is trying to address.

Detecting salient objects in 360-degree videos using text descriptions

Creating a dataset with frames, text, and saliency maps for training

Developing a method that integrates vision-language models for cross-modal analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language model for data representation

Integrates similarity estimation module for cross-modal analysis

Uses viewport spatio-temporal cross-attention mechanism

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs