ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent issue of physically implausible behaviors—such as object interpenetration and anti-gravity motion—in existing video world models for robotic manipulation tasks. To mitigate this, we propose a physics-aligned video generation paradigm featuring a 14B-parameter diffusion Transformer trained on 3 million physically annotated manipulation videos. Our approach integrates a novel DPO-based post-training framework, a decoupled discriminator, and a parallel context block for precise action conditioning. We further introduce EZSbench, the first training-agnostic, zero-shot embodied evaluation benchmark. The proposed method achieves state-of-the-art performance on both PBench and EZSbench, significantly outperforming Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. EZSbench is publicly released to foster standardized evaluation in embodied video generation.

Technology Category

Application Category

📝 Abstract
Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
Problem

Research questions and friction points this paper is trying to address.

physically implausible manipulation
video-based world models
embodied simulation
physics alignment
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-aware video generation
Diffusion Transformer
DPO-based post-training
Embodied zero-shot benchmark
Action-controllable world model
🔎 Similar Papers
No similar papers found.
Y
Yuzhi Chen
AMAP CV Lab, Alibaba Group
R
Ronghan Chen
AMAP CV Lab, Alibaba Group
D
Dongjie Huo
AMAP CV Lab, Alibaba Group
Yandan Yang
Yandan Yang
BIGAI (Beijing Institute for General Artificial Intelligence)
Computer VisionGenerationEmbodied AI
D
Dekang Qi
AMAP CV Lab, Alibaba Group
H
Haoyun Liu
AMAP CV Lab, Alibaba Group
T
Tong Lin
AMAP CV Lab, Alibaba Group
Shuang Zeng
Shuang Zeng
Peking University, Georgia Institute of Technology
Self-supervised Contrastive LearningMedical Image SegmentationSuperpixelLarge Language Model
J
Junjin Xiao
AMAP CV Lab, Alibaba Group
Xinyuan Chang
Xinyuan Chang
Xi'an Jiaotong University; Alibaba-Amap
Autonomous Driving,Computer Vision
Feng Xiong
Feng Xiong
Alibaba-inc
Computer Vision
X
Xing Wei
AMAP CV Lab, Alibaba Group
Z
Zhiheng Ma
AMAP CV Lab, Alibaba Group
Mu Xu
Mu Xu
alibaba
CV LLM VLM VLA RL