WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

📅 2026-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video world models, which support only camera navigation and struggle with object-centric interaction. To overcome this, the authors propose a trajectory-centric control framework that, for the first time, decouples and jointly coordinates camera navigation and object manipulation within a video world model. Users specify a target object and its desired motion path through a click and a trajectory sketch, enabling the model to generate visually coherent future frames. The approach introduces three core technical components: Normalized World Trajectories (NWT), Spatial Path LoRA (SP-LoRA), and Trajectory-Anchored State Persistence (TASP). These mechanisms simultaneously ensure high-fidelity camera control and precise object manipulation, while maintaining consistent off-screen object states throughout long-horizon autoregressive generation.
📝 Abstract
Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.
Problem

Research questions and friction points this paper is trying to address.

interactive video world models
object manipulation
camera navigation
trajectory control
world models
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive world models
object manipulation
trajectory control
camera-invariant representation
autoregressive video generation
B
Bohai Gu
The Hong Kong University of Science and Technology
T
Taiyi Wu
AI Technology Center, Tencent Video, Tencent
Y
Yueyang Yuan
Wuhan University
J
Jian Liu
The Hong Kong University of Science and Technology
X
Xiaocheng Lu
The Hong Kong University of Science and Technology
Dazhao Du
Dazhao Du
Hong Kong University of Science and Technology
MultiModal LLMVideo UnderstandingTime Series ForecastingDeep Learning
J
Jie Zhang
The Hong Kong University of Science and Technology
Jinxiang Lai
Jinxiang Lai
Hong Kong University of Science and Technology (HKUST)
Multimodal LLMFew-Shot LearningComputer Vision
S
Shuai Yang
Peking University
X
Xiaotong Zhao
AI Technology Center, Tencent Video, Tencent
A
Alan Zhao
AI Technology Center, Tencent Video, Tencent
Song Guo
Song Guo
Chair Professor of CSE, HKUST
Large Language ModelEdge AIMachine Learning Systems