Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the geometric distortions and multi-view inconsistencies commonly observed in single-image-to-orbit-video generation, which often arise from long-range viewpoint extrapolation. To overcome these limitations, we introduce, for the first time, compact latent features from 3D foundation generative models into video synthesis, implicitly modeling complete object geometry without explicit mesh reconstruction. Our approach jointly conditions diffusion-based video generation on a global latent vector and viewpoint-dependent latent image features, leveraging a novel multi-scale 3D adapter for model-agnostic fine-tuning and integrating cross-attention mechanisms into the generative framework. Extensive experiments demonstrate that our method significantly outperforms existing approaches across multiple benchmarks, achieving superior visual quality, geometric fidelity, and multi-view consistency, while robustly generalizing to complex camera trajectories and real-world input images.

Technology Category

Application Category

📝 Abstract

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

Problem

Research questions and friction points this paper is trying to address.

orbital video generation

view consistency

3D shape prior

single-image synthesis

geometric realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D foundation model

orbital video generation

shape prior