Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing zero-shot trajectory-guided image-to-video (I2V) methods suffer from three key limitations: reliance on costly fine-tuning, neglect of 3D geometry leading to motion distortions, and inconsistency between latent-space manipulation and noise prediction. This paper introduces the first zero-shot, 3D-aware I2V framework that requires no fine-tuning. Its core contributions are: (1) a depth-estimated, 3D-aware motion projection that maps trajectories under perspective constraints; (2) test-time dynamic LoRA injection and optimization to align motion guidance with the diffusion process; and (3) a single-step lookahead guidance field correction mechanism ensuring consistency between latent operations and noise prediction. Evaluated on multiple benchmarks, our method significantly outperforms both training-based and zero-shot prior approaches, achieving state-of-the-art performance in both motion accuracy and 3D plausibility.

Technology Category

Application Category

📝 Abstract

Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot 3D-aware trajectory-guided video generation

Addressing unrealistic motion from latent space manipulation

Ensuring generative fidelity with on-manifold adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-Aware Kinematic Projection for perspective-correct transformations

Trajectory-Guided Test-Time LoRA adapters for dynamic optimization

Guidance Field Rectification refining denoising evolutionary path

🔎 Similar Papers

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

2024-07-31arXiv.orgCitations: 21