🤖 AI Summary
Existing text-to-video diffusion models often generate videos inconsistent with prompts when depicting fine-grained compositional semantics involving entity relationships, attributes, actions, and motion directions. This work proposes Compositional Video Guidance (CVG), a method that leverages cross-attention maps from a frozen diffusion model during inference to train a lightweight compositional classifier. The classifier—built upon a vision-language model backbone—guides the early denoising process via its gradients, without requiring architectural modifications or retraining. Notably, it generalizes across semantically related compositional labels without relying on category-specific features or additional control signals. Experiments demonstrate that CVG significantly improves prompt fidelity on compositional text-to-video benchmarks while preserving high visual quality.
📝 Abstract
Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.