ClimateVID -- Social Media Videos Analysis and Challenges Involved

๐Ÿ“… 2026-04-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

191K/year
๐Ÿค– AI Summary
This study addresses the challenge of automatically identifying climate changeโ€“related visual themes in social media short videos without labeled data. The authors propose an analytical framework that integrates zero-shot classification with unsupervised clustering: visual-language models such as VideoChatGPT, PandaGPT, and VideoLLaVA are employed for zero-shot classification, while frame-level embeddings are extracted using CLIP, ConvNeXt V2, and DINOv2. A novel application of the minimum-cost multicut algorithm is introduced to cluster video frames. The work presents the first systematic evaluation of visual-language models for climate theme recognition, revealing their limited accuracy in identifying specific climate categories. In contrast, ConvNeXt V2 and DINOv2 yield semantically coherent clusters, capturing fine-grained content and stylized abstract features, respectively.
๐Ÿ“ Abstract
The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at https://github.com/KathPra/ClimateVID.git.
Problem

Research questions and friction points this paper is trying to address.

social media videos
visual theme detection
climate change
zero-shot classification
unsupervised clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot learning
visual clustering
vision-language models
minimum cost multicut
social media video analysis
๐Ÿ”Ž Similar Papers
No similar papers found.