TimeWarp: Evaluating Web Agents by Revisiting the Past

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing web agents suffer significant performance degradation when confronted with the temporal evolution of web interfaces, primarily due to the absence of robust evaluation frameworks and training mechanisms that account for cross-era UI variations. To address this gap, this work introduces TimeWarp, a benchmark that leverages containerization to reconstruct six distinct historical UI versions across three websites. Furthermore, we propose TimeTraj, an algorithm that employs multi-version plan distillation to generate robust action trajectories, replacing conventional single-version behavioral cloning. Experimental results demonstrate that our approach substantially enhances cross-UI generalization: success rates improve from 20.4% to 37.7% for Qwen-3 4B and from 0% to 27.0% for Llama-3.1 8B, marking the first systematic effort to enable robust evaluation and enhancement of web agents across evolving historical interfaces.

Technology Category

Application Category

📝 Abstract

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents'vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.

Problem

Research questions and friction points this paper is trying to address.

web agents

benchmark

temporal generalization

UI evolution

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

TimeWarp

web agent generalization

plan distillation