Learning Visual Feature-Based World Models via Residual Latent Action

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing visual world models often produce blurry or collapsed predictions in complex interactions due to direct regression, and generative modeling in high-dimensional feature spaces remains challenging. This work proposes a Residual Latent Action (RLA) representation that captures temporal dynamics by learning from residuals of DINO visual features, and introduces an RLA-based world model (RLA-WM) grounded in flow matching to directly predict future visual features rather than raw pixels. Notably, this approach enables, for the first time, fully offline reinforcement learning within a world model trained exclusively on action-free videos, augmented with a video-aligned reward mechanism. Experiments demonstrate that RLA-WM substantially outperforms state-of-the-art feature-based and video diffusion world models on both simulated and real-world datasets, achieving orders-of-magnitude faster inference and supporting efficient policy learning.

📝 Abstract

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Problem

Research questions and friction points this paper is trying to address.

world models

visual features

latent action

future prediction

generative modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual Latent Action

Feature-based World Model

Flow Matching