Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
Traditional world models rely on pixel-level encoding and heavy decoders, resulting in high computational costs and poor interpretability. This work proposes NOVA, a framework that directly models system states as the weights and biases of coordinate-based implicit neural representations (INRs) and generates images through analytical rendering, thereby circumventing decoder bottlenecks. Without requiring auxiliary losses or adversarial training, NOVA naturally disentangles structural components such as background, foreground, and motion, enabling independent editing of content and dynamics as well as zero-shot super-resolution. Evaluated on multiple challenging datasets, NOVA achieves efficient and highly controllable video prediction using only a single consumer-grade GPU and approximately 40 million parameters.
📝 Abstract
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
Problem

Research questions and friction points this paper is trying to address.

world models
latent space
decoder bottleneck
computational expense
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit neural representation
world models
latent disentanglement
decoder-free rendering
controllable video generation
🔎 Similar Papers