R-WoM: Retrieval-augmented World Model For Computer-use Agents

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) serve as world models in digital environments but suffer from hallucination and outdated static knowledge, undermining reliability in long-horizon state prediction and reward estimation. To address this, we propose Retrieval-Augmented World Models (R-WoM), a framework that dynamically grounds LLM cognition via real-time retrieval of external tutorial knowledge—mitigating hallucination and overcoming temporal limitations of training data. R-WoM supports three core tasks: next-state identification, end-to-end planning alignment, and milestone transition detection. Evaluated on OSWorld and WebArena benchmarks, R-WoM achieves +25.3% and +18.1% absolute improvements, respectively, with particularly pronounced gains in reducing error accumulation for long-horizon tasks. This work introduces the first systematic integration of structured external tutorial knowledge into LLM-based world modeling, establishing a scalable, retrieval-augmented paradigm for robust long-term planning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM limitations in long-horizon digital environment simulations
Improves world modeling by integrating retrieved external tutorial knowledge
Enhances agent decision-making through retrieval-augmented future state prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented World Model enhances agent decision-making
Incorporates factual knowledge from external tutorials
Substantially improves long-horizon simulation performance