The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses key challenges in controllable world-event video generation—including multi-agent interaction, object entry/exit, reference appearance preservation, and counterfactual event modeling—by proposing the first trajectory-text-image trimodal collaborative framework. Methodologically, it jointly models textual semantics, explicit trajectories (encoding motion, temporal ordering, and visibility), and reference images (serving as visual identity anchors), incorporating a trajectory encoder, reference-guided diffusion generation module, and spatiotemporal consistency regularization. Key contributions include: (i) strong cross-frame object identity consistency; (ii) spontaneous recovery of appearance and scene context after temporary occlusion or exit; and (iii) fine-grained user control over complex events (e.g., multi-agent collaboration, counterfactual actions) via natural language prompts. Experiments demonstrate significant improvements over existing unimodal and bimodal baselines in temporal coherence, emergent consistency, and reference appearance fidelity.

Technology Category

Application Category

📝 Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generates coherent, controllable world events using multimodal prompts

Combines trajectories, text, and images for user-directed simulation

Advances world models into interactive, user-shaped simulators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal approach combines trajectories, text, and reference images

Generates coherent, controllable events with multi-agent interactions

Advances world models from passive predictors to interactive simulators

🔎 Similar Papers

No similar papers found.