🤖 AI Summary
This work addresses the challenge of precisely controlling object dynamics and camera trajectories in video generation via natural language—a longstanding limitation in semantic video synthesis. To this end, we propose the first end-to-end language-driven framework for joint 3D object–camera motion generation. Methodologically, we design a domain-specific language (DSL) tailored to cinematic motion, leveraging large language models and program synthesis to automatically parse natural language descriptions into structured, executable 3D trajectory programs. We further introduce the first large-scale text–program–trajectory triplet dataset to support training and evaluation. Compared to prior approaches, our method significantly improves motion controllability and alignment with user intent, while preserving high-fidelity 3D motion planning. Crucially, it offers strong interpretability and post-hoc editability of generated motions. This establishes a novel paradigm for semantic, film-grade video generation.
📝 Abstract
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.