🤖 AI Summary
This study addresses the limitation of existing video understanding research, which predominantly focuses on instance-level recognition and struggles to model high-level semantics in user-generated short videos. To bridge this gap, the authors introduce USV, a large-scale dataset comprising 224,000 unedited, uncurated user-generated short videos, and formally define two novel high-level semantic understanding tasks: topic classification and video–text retrieval. They further propose two baseline approaches—Multimodal Fusion Network (MMF-Net) and Video–Text Contrastive Learning (VTCL)—that enable end-to-end cross-modal semantic alignment. This work establishes the first benchmark dataset and evaluation framework specifically designed for high-level semantic understanding of user-generated short videos, thereby laying a foundational platform for future research in this domain.
📝 Abstract
Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.