Self-Improving World Modelling with Latent Actions

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling large language models and vision-language models to learn interpretable and planning-capable world dynamics from trajectory data lacking explicit action labels. The authors propose SWIRL, a novel framework that treats actions as latent variables and achieves unsupervised self-improving world modeling through alternating optimization of a forward world model and an inverse dynamics model. By integrating variational information maximization with evidence lower bound (ELBO) optimization via a coordinate ascent strategy, SWIRL leverages the log-likelihood from one model as a reward signal for the other within a GRPO-based reinforcement learning loop. The approach provides theoretical guarantees for model identifiability and consistency. Empirical results demonstrate consistent performance gains of 16%, 28%, 16%, and 14% on AURORABench, ByteMorph, WorldPredictionBench, and StableToolBench, respectively.

Technology Category

Application Category

📝 Abstract
Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_\theta(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_\phi(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
Problem

Research questions and friction points this paper is trying to address.

world modelling
latent actions
state-only sequences
action-labelled trajectories
internal modelling
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent actions
self-improving world modelling
forward-inverse dynamics
variational information maximisation
GRPO reinforcement learning
🔎 Similar Papers
No similar papers found.