EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses zero-shot, training-free text-to-video generation without modifying existing image diffusion model architectures. Methodologically, it introduces a model-agnostic paradigm for exploiting intersection points in latent diffusion trajectories, coupled with a gridded temporal control mechanism to jointly optimize inter-frame consistency and diversity without architectural changes. It further incorporates context-aware LLMs to generate frame-level prompts and identify inter-frame semantic discrepancies, augmented by CLIP-based attention masks for dynamic prompt scheduling. Key technical components include diffusion trajectory analysis, spatiotemporal grid modeling, LLM-finetuned prompt generation, CLIP-driven scheduling, and zero-shot latent variable planning. Quantitative evaluations and user studies demonstrate state-of-the-art performance, achieving significant improvements in temporal coherence, visual fidelity, and subjective quality. The framework is plug-and-play compatible with any mainstream image diffusion model.

Technology Category

Application Category

📝 Abstract

Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.

Problem

Research questions and friction points this paper is trying to address.

Achieving zero-shot text-to-video generation without model retraining

Ensuring frame coherence and diversity in diffusion-based video synthesis

Providing model-agnostic flexibility for diverse image-generation frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic diffusion trajectory intersections

Grid-based approach for frame coherence

CLIP-based attention mask for prompt switching

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence