🤖 AI Summary
This paper systematically identifies and classifies five key technical barriers to archiving and replaying web advertisements: (1) proactive exclusion of ads by mainstream web archiving services; (2) crawlers’ inability to execute ad-related JavaScript; (3) link rot caused by dynamically generated URLs; (4) failed resource requests during cross-iframe ad loading; and (5) Service Worker interference with ad replay in “about:blank”-origin iframes. Leveraging a real-world dataset of 279 ads, we propose deployable mitigation strategies—including fuzzy matching enhancement, Blob URL injection and rewriting, and dynamic script context isolation. Our approach achieves the first end-to-end archival support for dynamically embedded ads, significantly improving replay success rates for major ad networks (e.g., Google, Amazon, Flashtalking). The work establishes a reusable technical paradigm and practical methodology for long-term preservation of dynamic web content.
📝 Abstract
Although web advertisements represent an inimitable part of digital cultural heritage, serious archiving and replay challenges persist. To explore these challenges, we created a dataset of 279 archived ads. We encountered five problems in archiving and replaying them. For one, prior to August 2023, Internet Archive's Save Page Now service excluded not only well-known ad services' ads, but also URLs with ad related file and directory names. Although after August 2023, Save Page Now still blocked the archiving of ads loaded on a web page, it permitted the archiving of an ad's resources if the user directly archived the URL(s) associated with the ad. Second, Brozzler's incompatibility with Chrome prevented ads from being archived. Third, during crawling and replay sessions, Google's and Amazon's ad scripts generated URLs with different random values. This precluded archived ads' replay. Updating replay systems' fuzzy matching approach should enable the replay of these ads. Fourth, when loading Flashtalking web page ads outside of ad iframes, the ad script requested a non-existent URL. This, prevented the replay of ad resources. But as was the case with Google and Amazon ads, updating replay systems' fuzzy matching approach should enable Flashtalking ads' replay. Finally, successful replay of ads loaded in iframes with the src attribute of"about:blank"depended upon a given browser's service worker implementation. A Chromium bug stopped service workers from accessing resources inside of this type of iframe, which in turn prevented replay. Replacing the"about:blank"value for the iframe's src attribute with a blob URL before an ad was loaded solved this problem. Resolving these replay problems will improve the replay of ads and other dynamically loaded embedded web resources that use random values or"about:blank"iframes.