ClimateVID -- Social Media Videos Analysis and Challenges Involved

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the challenge of automatically identifying climate change–related visual themes in social media short videos without labeled data. The authors propose an analytical framework that integrates zero-shot classification with unsupervised clustering: visual-language models such as VideoChatGPT, PandaGPT, and VideoLLaVA are employed for zero-shot classification, while frame-level embeddings are extracted using CLIP, ConvNeXt V2, and DINOv2. A novel application of the minimum-cost multicut algorithm is introduced to cluster video frames. The work presents the first systematic evaluation of visual-language models for climate theme recognition, revealing their limited accuracy in identifying specific climate categories. In contrast, ConvNeXt V2 and DINOv2 yield semantically coherent clusters, capturing fine-grained content and stylized abstract features, respectively.

📝 Abstract

The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual frames. %Given that VLMs are not currently capable to grasp the climate change discourse, we focus the clustering evaluation of image embedding models. We find that both ConvNeXt V2 and DINOv2 produce meaningful clusters, with DINOv2 focusing more on style differences and abstract categories, while ConvNeXt V2 clusters differ in more fine-grained ways. Code available at https://github.com/KathPra/ClimateVID.git.

Problem

Research questions and friction points this paper is trying to address.

social media videos

visual theme detection

climate change

zero-shot classification

unsupervised clustering

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot learning

visual clustering

vision-language models