๐ค AI Summary
This study addresses the scarcity of multimodal political communication data in authoritarian contexts, which has hindered related research. To bridge this gap, the authors construct the first cross-modal, multilingual dataset of high-level Russian government speeches, encompassing parallel RussianโEnglish texts, accompanying images, rich metadata, and expert-validated thematic tags. A unique identifier enables precise alignment across textual content, visual elements, and language versions. Innovatively integrating Transformer-based multimodal topic modeling with domain expert knowledge, the project achieves robust semantic alignment and rigorous data validation. The resulting resource spans several decades and offers a high-quality, scalable foundation for spatiotemporal analyses of authoritarian political communication and for advancing large language model applications in political science.
๐ Abstract
This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.