🤖 AI Summary
This work addresses the challenge of enhancing motion dynamics in text-to-video generation without inducing semantic drift or object distortion, which commonly arises from explicit negative prompts. To this end, the authors propose an implicit contrastive mechanism that constructs local negative sample anchors by injecting Gaussian noise into concept embeddings, coupled with a staged classifier-free guidance (CFG) scheduling strategy that applies intervention only during early denoising steps. This approach eliminates the need for explicit negative prompts, thereby effectively avoiding semantic bias and enabling, for the first time, precise control over complex nonlinear semantic attributes such as motion trajectories and object counts. Evaluated across multiple state-of-the-art text-to-video frameworks, the method significantly improves motion expressiveness with minimal computational overhead and negligible degradation in visual quality.
📝 Abstract
Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.