🤖 AI Summary
This work addresses the rapid degradation of visual fidelity in autoregressive multi-step rollouts of action-conditioned robotic world models, a problem primarily caused by error accumulation. To mitigate this, the authors propose a post-training framework grounded in contrastive reinforcement learning, which optimizes the model using its own generated rollout sequences. The approach introduces a multi-candidate, variable-length future comparison mechanism and a visual fidelity reward that integrates multi-view perceptual metrics. This design significantly enhances both consistency and realism in long-horizon predictions. Evaluated on the DROID dataset, the method achieves state-of-the-art performance in rollout fidelity, reducing LPIPS by 14% and improving SSIM by 9.1%, with human evaluators expressing an 80% preference for its outputs in blind tests.
📝 Abstract
Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.