Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

๐Ÿ“… 2026-05-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

214K/year
๐Ÿค– AI Summary
This study addresses the scarcity of multimodal political communication data in authoritarian contexts, which has hindered related research. To bridge this gap, the authors construct the first cross-modal, multilingual dataset of high-level Russian government speeches, encompassing parallel Russianโ€“English texts, accompanying images, rich metadata, and expert-validated thematic tags. A unique identifier enables precise alignment across textual content, visual elements, and language versions. Innovatively integrating Transformer-based multimodal topic modeling with domain expert knowledge, the project achieves robust semantic alignment and rigorous data validation. The resulting resource spans several decades and offers a high-quality, scalable foundation for spatiotemporal analyses of authoritarian political communication and for advancing large language model applications in political science.
๐Ÿ“ Abstract
This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
Problem

Research questions and friction points this paper is trying to address.

authoritarian politics
multimodal data
political communication
Russian government speeches
social text and image data
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal linking
authoritarian political communication
transformer-based topic modeling
multilingual aligned dataset
validated topical annotation