Temporally Extending Existing Web Archive Collections for Longitudinal Analysis

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Environmental Data Governance Initiative (EDGI) archives cover only 2016–2020, omitting critical environmental web data from the Obama administration (2009–2017), thereby impeding溯源 analysis of policy terminology deleted during the Trump administration (2017–2021). Method: We propose a longitudinal archival integration framework that fuses heterogeneous web snapshots—from Save Page Now, Archive-It, and other sources—via URL alignment, page persistence identification, and cross-version content comparison. Contribution/Results: Our method enables the first systematic reconstruction of U.S. federal environmental agency websites for 2008–2020. The expanded dataset supports inter-administration policy analysis, revealing that 81% of pages underwent substantive change; notably, 87% of terms removed under Trump originated in Obama-era additions. This provides foundational temporal evidence and a reproducible methodological paradigm for studying environmental policy evolution.

Technology Category

Application Category

📝 Abstract
The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, Were the website terms deleted by the Trump administration (2017--2021) added by the Obama administration (2009--2017)? Thus, like many researchers using the Wayback Machine's holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions. In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We use our new dataset to analyze our question, Were the website terms deleted by the Trump administration added by the Obama administration? We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration.
Problem

Research questions and friction points this paper is trying to address.

Extend web archive collections to cover missing presidential administration periods
Analyze changes in US federal environmental websites across administrations
Determine if deleted terms by Trump were added by Obama
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extend web archive collections temporally
Combine artifacts from multiple sources
Methodology for longitudinal analysis