RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing robotic video world models suffer from insufficient alignment between training objectives and decision-critical capabilities—such as instruction following, task success rate, and physical plausibility—and are prone to error accumulation in long-horizon autoregressive prediction. To address these limitations, this work proposes a reward-aligned post-training framework that integrates knowledge distillation from a multimodal teacher critic with a training-free Sliding Window Recoding (SWR) strategy. This approach significantly enhances semantic consistency and long-term stability of generated videos. Evaluated on RobotWorldBench, the method achieves a 10.1% improvement in overall six-dimensional scores, with gains of 7.5% in manipulation accuracy and 4.6% in instruction adherence. Moreover, SWR boosts long-horizon prediction quality, increasing SSIM by 2.8% and reducing LPIPS by 9.8%, while introducing only approximately 1% additional inference latency.

📝 Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.

Problem

Research questions and friction points this paper is trying to address.

robot video world models

reward alignment

long-horizon prediction

instruction following

physical plausibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward alignment

multimodal distillation

robot video world models