🤖 AI Summary
This work addresses the problem of action-visual inconsistency in robot video generation, where action vectors are used only as passive conditioning signals. To resolve this, we propose an action-aware diffusion inference framework that requires no additional training. Our method comprises two key components: (1) classifier-free guidance with action-scaled guidance weights, dynamically modulating denoising strength; and (2) action-driven Gaussian latent initialization and noise truncation, explicitly modeling the temporal influence of action trajectories on the generative process. Crucially, our approach enables the first differentiable, active control of diffusion models by action parameters *during inference*, without architectural or training modifications. Experiments on real-world robot manipulation datasets demonstrate significant improvements in motion coherence and visual fidelity. The framework is broadly applicable to trajectory-to-video synthesis across diverse robotic scenarios.
📝 Abstract
Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.