MOVi: Training-free Text-conditioned Multi-Object Video Generation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing text-to-video (T2V) diffusion models struggle to accurately model dynamic multi-object interactions, often misclassifying objects as static background and suffering from object omission, misalignment, or feature entanglement. This work proposes a training-free, multi-object text-driven video generation framework. We introduce a large language model (LLM) as a “director” to explicitly plan spatiotemporal trajectories for each object; integrate noise reinitialization with attention map editing to enable object-level motion control and feature disentanglement; and incorporate open-world knowledge distillation to enhance semantic consistency. Preserving high visual fidelity and motion smoothness, our method improves multi-object motion dynamics and generation accuracy by 42 percentage points over prior approaches. To the best of our knowledge, it is the first framework enabling prompt-driven, fine-grained, and editable multi-agent video synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.

Problem

Research questions and friction points this paper is trying to address.

Generating videos with multiple interacting objects accurately

Preventing incorrect object features and mixed generations

Enhancing motion dynamics without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free text-conditioned video generation

LLM-directed object trajectories via noise re-initialization

Attention manipulation for object-specific features

🔎 Similar Papers

No similar papers found.