🤖 AI Summary
This study addresses the longstanding absence of large-scale, high-quality digital datasets of U.S. presidential campaign television advertisements. We propose the first end-to-end, parallelized AI analytics pipeline—integrating automatic speech recognition (ASR), large language model (LLM)-driven video understanding and abstractive summarization, distributed preprocessing, and human-in-the-loop quality verification—to automatically construct a digital dataset comprising 9,707 ads spanning 1952–2012. To date, this is the most comprehensive and temporally extensive (70-year) resource of its kind, enabling longitudinal analysis of campaign issue evolution across decades. Human evaluation confirms that generated transcripts and summaries achieve parity with manual annotations in accuracy and informativeness. The entire pipeline—including code, models, and data—is fully open-sourced, establishing both a benchmark dataset and a reproducible methodological framework for political communication research and fine-grained video semantic analysis.
📝 Abstract
This paper introduces the largest and most comprehensive dataset of US presidential campaign television advertisements, available in digital format. The dataset also includes machine-searchable transcripts and high-quality summaries designed to facilitate a variety of academic research. To date, there has been great interest in collecting and analyzing US presidential campaign advertisements, but the need for manual procurement and annotation led many to rely on smaller subsets. We design a large-scale parallelized, AI-based analysis pipeline that automates the laborious process of preparing, transcribing, and summarizing videos. We then apply this methodology to the 9,707 presidential ads from the Julian P. Kanter Political Commercial Archive. We conduct extensive human evaluations to show that these transcripts and summaries match the quality of manually generated alternatives. We illustrate the value of this data by including an application that tracks the genesis and evolution of current focal issue areas over seven decades of presidential elections. Our analysis pipeline and codebase also show how to use LLM-based tools to obtain high-quality summaries for other video datasets.