🤖 AI Summary
Existing open-source text-to-video models are constrained by proprietary training data, while public datasets (e.g., Koala-36M) rely solely on algorithmic filtering, resulting in suboptimal visual fidelity and spatiotemporal coherence—limiting their utility for high-quality fine-tuning. To address this, we introduce the first high-quality UGC video dataset specifically designed for fine-tuning video generation models, built upon a human-centric curation paradigm that rigorously evaluates both aesthetic quality and temporal consistency—overcoming inherent limitations of purely algorithmic filtering. Our pipeline integrates shot boundary detection, OCR, motion-aware filtering, and bilingual subtitle generation to produce 200K high-fidelity short-video–bilingual-subtitle pairs. Empirical evaluation demonstrates substantial improvements in fine-tuning performance across leading text-to-video models. The dataset is publicly released and actively expanding.
📝 Abstract
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: https://tinytigerpan.github.io/tiger200k/