π€ AI Summary
Existing text-to-video (T2V) diffusion models struggle to accurately model dynamic multi-object interactions, often misclassifying objects as static background and suffering from object omission, misalignment, or feature entanglement. This work proposes a training-free, multi-object text-driven video generation framework. We introduce a large language model (LLM) as a βdirectorβ to explicitly plan spatiotemporal trajectories for each object; integrate noise reinitialization with attention map editing to enable object-level motion control and feature disentanglement; and incorporate open-world knowledge distillation to enhance semantic consistency. Preserving high visual fidelity and motion smoothness, our method improves multi-object motion dynamics and generation accuracy by 42 percentage points over prior approaches. To the best of our knowledge, it is the first framework enabling prompt-driven, fine-grained, and editable multi-agent video synthesis.
π Abstract
Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.