Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing world models struggle to maintain long-term visual consistency in real-world videos, primarily due to noisy pose estimates and sparse viewpoint revisits. To address this, this work proposes a Hierarchical Pose-free Memory Compressor (HPMC) coupled with an uncertainty-aware three-state logic action discretization scheme, along with a revisit-dense fine-tuning strategy. This framework enables efficient long-horizon modeling without relying on geometric priors or accurate pose information. The approach substantially improves the visual fidelity, action controllability, and spatial coherence of generated videos, achieving— for the first time in real-world settings—coherent interactive generation spanning over 1,000 frames.

Technology Category

Application Category

📝 Abstract

We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model's long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

Problem

Research questions and friction points this paper is trying to address.

world models

long-horizon prediction

pose-free memory

real-world video

action-response learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-Free Memory

Hierarchical Compression

Uncertainty-aware Action Labeling