π€ AI Summary
This study addresses a critical gap in existing researchβthe lack of transparent, reproducible, and large-scale longitudinal data on YouTube creators and their audience sizes. Leveraging nearly two decades of YouTube page snapshots from the Internet Archive, this work constructs the first API-independent longitudinal dataset, encompassing approximately 30β36% of content creators, including numerous prominent channels. Through historical webpage parsing, rigorous data cleaning, and robust channel identifier mapping, the authors deliver a comprehensive census of YouTube creators at scale. The resulting resource is packaged as an open-source Python pip package, significantly lowering barriers to data access and enabling exploratory analyses of the creator economy and platform recommendation mechanisms without reliance on official APIs.
π Abstract
YouTube is central to contemporary mass media. However, the official YouTube API does not provide access to the full set of creators or creator metadata on the platform. This lack of basic visibility into the YouTube ecosystem hinders understanding of the platform's creator economy. Researchers currently have no easy, transparent, or replicable way to construct large-scale datasets of YouTube creators and their audiences over time. This makes it challenging to study vital social questions, such as how changes to the YouTube recommendation algorithm shape creator incentives and by extension the mass media on the platform. We address this gap with TubeCensus, a large-scale longitudinal dataset of YouTube creators and subscriber counts, constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. This approach is transparent and replicable and does not require interaction with the YouTube API, whose output can change over time. We validate the coverage of TubeCensus against prior estimates of YouTube's size and find that our resource includes creators responsible for at least 30-36% of all YouTube content. We also find that TubeCensus provides good coverage of prominent creators. To support future research, we hide the substantial complexities of the YouTube identifier system and Internet Archive capture system by distributing our dataset via an easy-to-use pip package. Finally, we use our resource to complete basic exploratory analysis of YouTube channel content and the mechanisms associated with YouTube channel growth.