🤖 AI Summary
This study addresses sampling bias in TikTok data collection, arising from API restrictions and the opaque nature of its recommendation algorithm. To overcome these challenges, we propose a reverse-engineering methodology integrating HTTP traffic analysis, ID-space inference, and temporal feature modeling—enabling a platform-wide, high-coverage, and reproducible sampling paradigm. Leveraging distributed crawler orchestration and synchronized video-comment acquisition, our approach captures over 99% of newly posted content within one hour and performs complete hourly time-slice sampling (one-minute duration per hour) across the full 24-hour cycle. We thus construct the first high-quality, temporally resolved dataset encompassing videos, metadata, and comments spanning an entire day. Based on this dataset, we estimate TikTok’s daily global posting volume at 117 million posts—substantially revising upward prior empirical benchmarks. This work establishes a novel methodological foundation for large-scale empirical research on social media platforms.
📝 Abstract
TikTok is now a massive platform, and has a deep impact on global events. But for all the preliminary studies done on it, there are still issues with determining fundamental characteristics of the platform. We develop a method to extract a representative sample from a specific time range on TikTok, and use it to collect>99% of posts from a full hour on the platform, alongside a dataset of>99% of posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media data, and comments from a close to complete slice of TikTok. Using this dataset, we report the critical statistics of the platform, notably estimating a total of 117 million posts produced on the day we looked at on TikTok.