4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing 4D video generation methods suffer from inadequate spatiotemporal modeling, hindering joint optimization of spatial structure and temporal dynamics; meanwhile, 3D reconstruction algorithms lack adaptability to continuous dynamic scenes. This paper introduces the first feed-forward 4D spatiotemporal video generation framework, jointly synthesizing high-fidelity video frames and dynamic 3D Gaussian splatting fields. Our key contributions are: (1) a view-time joint attention mechanism that unifies cross-view and inter-frame dependency modeling within a single attention layer; (2) a sparse attention pattern coupled with camera token replacement to balance computational efficiency and temporal coherence; and (3) an integrated architecture combining a 4D diffusion prior, a differentiable Gaussian rendering head, and a dynamic reconstruction network. Evaluated on standard 4D generation benchmarks, our method achieves state-of-the-art performance, significantly improving both visual fidelity and 3D geometric reconstruction accuracy.

Technology Category

Application Category

📝 Abstract

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

Problem

Research questions and friction points this paper is trying to address.

Develops a 4D scene generation framework for video and 3D particles

Introduces fused spatial-temporal attention in a single layer

Enhances 3D reconstruction with Gaussian head and dynamic layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fused view-time attention in single layer

Feed-forward 4D spatio-temporal grid generation

Gaussian head and dynamic layers enhancement

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency