🤖 AI Summary
This paper addresses the fairness and timeliness challenges in web cache refresh under bandwidth constraints, proposing a robust crawling scheduling framework leveraging noisy change signals (e.g., sitemaps, CDN notifications). The core problem lies in signal unreliability—characterized by false positives and false negatives—and heterogeneous noise levels across pages, which hinder conventional approaches from simultaneously achieving fairness and efficiency. Methodologically, we introduce the first optimal robust fusion of heterogeneous noise-side information, integrated with Poisson-process-based stochastic scheduling, distributed online rate control, and dynamic bandwidth adaptation. The system supports decentralized deployment while maintaining constant bandwidth consumption. Experiments demonstrate a 37% improvement in bandwidth utilization over signal-agnostic baselines and state-of-the-art heuristics, elimination of request latency spikes, and significant gains in cross-page scheduling fairness and temporal consistency.
📝 Abstract
Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.