EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitations of existing zero-shot robotic manipulation methods, which often fail due to physically implausible behaviors and inaccuracies in depth and keypoint estimation when mapping pixels to actions via video generation models. To overcome these issues, the authors propose a training-free, inference-time framework that uniquely integrates the structured spatial reasoning capabilities of vision-language models with video generation models. By extracting task-relevant compositional constraints, the method guides replay selection and optimizes robot trajectories to satisfy both physical plausibility and semantic fidelity. Evaluated on six real-world manipulation tasks, the approach achieves a 43.3 percentage point improvement in overall success rate over the strongest baseline, substantially enhancing the physical realism and execution reliability of zero-shot robotic manipulation.

Technology Category

Application Category

📝 Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

Problem

Research questions and friction points this paper is trying to address.

video generative models

physically implausible rollouts

geometric retargeting

cumulative errors

zero-shot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional constraints

zero-shot manipulation

vision-language models