TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Existing text-to-video generation methods achieve notable progress in visual fidelity but exhibit limited control over subject composition in complex, multi-object scenes—particularly lacking explicit semantic correspondence between object trajectories and visual entities. To address this, we propose Text-Anchored Trajectories (TAT), a framework that pairs point-level motion trajectories with textual descriptions to enable fine-grained, localized control over both appearance and dynamics. Our key contributions are: (1) Location-Aware Cross-Attention (LACA), which jointly models trajectory–text–pixel ternary relationships; (2) a dual-conditioning guidance mechanism that jointly optimizes text alignment and motion consistency; and (3) a fully automated annotation and training pipeline, trained on 2 million high-quality video clips. Extensive experiments demonstrate that TAT significantly outperforms state-of-the-art methods in visual quality, text–video alignment accuracy, and controllability of multi-object motion.

Technology Category

Application Category

📝 Abstract

Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

Problem

Research questions and friction points this paper is trying to address.

Enhancing precise local control of multiple objects in video generation

Establishing clear correspondence between trajectories and visual entities

Improving motion controllability and text alignment in complex scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Grounded Trajectories framework for video generation

Location-Aware Cross-Attention integrates text and trajectories

Dual-CFG scheme modulates local and global guidance

🔎 Similar Papers

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance