TubeCensus: A Transparent, Replicable, and Large-Scale Census of YouTube Channels and their Subscriber Counts Over Time

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

245K/year
πŸ€– AI Summary
This study addresses a critical gap in existing researchβ€”the lack of transparent, reproducible, and large-scale longitudinal data on YouTube creators and their audience sizes. Leveraging nearly two decades of YouTube page snapshots from the Internet Archive, this work constructs the first API-independent longitudinal dataset, encompassing approximately 30–36% of content creators, including numerous prominent channels. Through historical webpage parsing, rigorous data cleaning, and robust channel identifier mapping, the authors deliver a comprehensive census of YouTube creators at scale. The resulting resource is packaged as an open-source Python pip package, significantly lowering barriers to data access and enabling exploratory analyses of the creator economy and platform recommendation mechanisms without reliance on official APIs.
πŸ“ Abstract
YouTube is central to contemporary mass media. However, the official YouTube API does not provide access to the full set of creators or creator metadata on the platform. This lack of basic visibility into the YouTube ecosystem hinders understanding of the platform's creator economy. Researchers currently have no easy, transparent, or replicable way to construct large-scale datasets of YouTube creators and their audiences over time. This makes it challenging to study vital social questions, such as how changes to the YouTube recommendation algorithm shape creator incentives and by extension the mass media on the platform. We address this gap with TubeCensus, a large-scale longitudinal dataset of YouTube creators and subscriber counts, constructed by collecting, linking, and organizing nearly two decades of YouTube page captures from the Internet Archive. This approach is transparent and replicable and does not require interaction with the YouTube API, whose output can change over time. We validate the coverage of TubeCensus against prior estimates of YouTube's size and find that our resource includes creators responsible for at least 30-36% of all YouTube content. We also find that TubeCensus provides good coverage of prominent creators. To support future research, we hide the substantial complexities of the YouTube identifier system and Internet Archive capture system by distributing our dataset via an easy-to-use pip package. Finally, we use our resource to complete basic exploratory analysis of YouTube channel content and the mechanisms associated with YouTube channel growth.
Problem

Research questions and friction points this paper is trying to address.

YouTube creator economy
platform transparency
longitudinal data
algorithmic impact
media ecosystem
Innovation

Methods, ideas, or system contributions that make the work stand out.

YouTube census
longitudinal dataset
Internet Archive
transparent methodology
creator economy
πŸ”Ž Similar Papers
No similar papers found.