NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing world action models for joint video-action generation employ a uniform timestep schedule, erroneously assuming that all latent variables contribute equally and reliably to action generation, thereby limiting the dynamic coupling between perception and control. This work proposes a learnable per-latent-variable timestep scheduling mechanism that models the timestep of each latent frame as an information gating policy, dynamically modulating its contribution weight during denoising to enable integrated optimization of perception, prediction, and control. Built upon a hybrid Transformer backbone, the approach combines independent timestep sampling, a lightweight gating network, and task-reward-driven end-to-end training. Evaluated on diverse manipulation tasks in RoboTwin, it achieves significant performance gains, demonstrating the efficacy of dynamic information gating.

📝 Abstract

World Action Models (WAMs) are an emerging family of policies that tie robot action generation to future-observation modeling. In this work, we focus on the joint video--action modeling paradigm, where actions and imagined future observations are co-generated along a shared denoising or flow trajectory, so that perception, prediction, and control are coupled within one generative process. Existing WAMs typically realize this paradigm with a Mixture-of-Transformers (MoT), where video and action tokens interact through shared self-attention. This architecture can in principle assign a separate timestep $t_f$ to each predicted latent frame, yet current systems collapse this degree of freedom onto a single shared scalar $t$. Under the noise-as-masking view of Diffusion Forcing, this shared schedule imposes the unjustified prior that every predicted latent is equally reliable for action generation. We instead view the per-latent schedule as a \emph{learnable information-gating policy}: by changing a latent frame's noise level, the policy modulates the reliability of its Key/Value contribution to the action tokens. We propose \textbf{NoiseGate}, which combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization that trains the schedule policy without hand-crafted shape priors. Built on a joint video--action MoT backbone, NoiseGate delivers consistent gains on diverse RoboTwin random-scene manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

World Action Models

information gating

per-latent timestep

joint video-action modeling

Diffusion Forcing

Innovation

Methods, ideas, or system contributions that make the work stand out.

NoiseGate

World Action Models

per-latent timestep scheduling