OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

๐Ÿ“… 2024-07-02
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 30
โœจ Influential: 4
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-video (T2V) generation is hindered by two key bottlenecks: scarcity of high-quality open-source training data and insufficient exploitation of fine-grained textual semantics. To address these, we introduce OpenVid-1Mโ€”the first large-scale (1M video-text pairs), high-fidelity, and fully open benchmark datasetโ€”along with its high-definition subset, OpenVidHD-0.4M. We further propose the Multimodal Video Diffusion Transformer (MVDiT), which overcomes limitations of conventional cross-attention mechanisms via joint visual-textual token modeling, structured spatiotemporal attention, and high-fidelity resampling. This enables synergistic learning of visual structure and deep textual semantics. Experiments demonstrate that OpenVid-1M significantly improves training stability and generation consistency. MVDiT achieves state-of-the-art performance across multiple T2V benchmarks and, for the first time, enables practical, high-quality 1080p video generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.
Problem

Research questions and friction points this paper is trying to address.

Lacking a high-quality dataset for text-to-video generation.
Ignoring full utilization of textual information in T2V.
Need for precise text-video pairs to advance T2V research.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OpenVid-1M dataset
Proposes Multi-modal Video Diffusion Transformer
Creates OpenVidHD-0.4M for HD video
๐Ÿ”Ž Similar Papers
No similar papers found.
Kepan Nan
Kepan Nan
Nanjing University
Computer VisionVideo Generation
R
Rui Xie
Nanjing University
P
Penghao Zhou
ByteDance
Tiehan Fan
Tiehan Fan
Nanjing University
AIGCMultiModal Learning
Zhenheng Yang
Zhenheng Yang
TikTok
Computer VisionMachine LearningDeep Learning
Z
Zhijie Chen
ByteDance
X
Xiang Li
Nankai University
J
Jian Yang
Nanjing University
Y
Ying Tai
Nanjing University