UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing world models struggle to maintain long-term scene consistency and support user-specified camera control, often exhibiting content inconsistencies and poor controllability upon revisiting scenes. To address these limitations, this work proposes a time-aware positional encoding warping mechanism integrated within a dual-stream diffusion Transformer architecture, which jointly models long-term memory and enables precise camera control without requiring explicit 3D reconstruction. The approach leverages a large-scale monocular video dataset synthesized via point-cloud-based rendering to facilitate efficient and high-fidelity controllable video generation. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches on both real-world and synthetic benchmarks, achieving leading performance in long-term consistency and camera controllability.

Technology Category

Application Category

📝 Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

Problem

Research questions and friction points this paper is trying to address.

world models

long-term consistency

camera control

video generation

scene revisiting

Innovation

Methods, ideas, or system contributions that make the work stand out.

time-aware positional encoding warping

camera control

long-term memory