LAMP: Language-Assisted Motion Planning for Controllable Video Generation

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenge of precisely controlling object dynamics and camera trajectories in video generation via natural language—a longstanding limitation in semantic video synthesis. To this end, we propose the first end-to-end language-driven framework for joint 3D object–camera motion generation. Methodologically, we design a domain-specific language (DSL) tailored to cinematic motion, leveraging large language models and program synthesis to automatically parse natural language descriptions into structured, executable 3D trajectory programs. We further introduce the first large-scale text–program–trajectory triplet dataset to support training and evaluation. Compared to prior approaches, our method significantly improves motion controllability and alignment with user intent, while preserving high-fidelity 3D motion planning. Crucially, it offers strong interpretability and post-hoc editability of generated motions. This establishes a novel paradigm for semantic, film-grade video generation.

Technology Category

Application Category

📝 Abstract

Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.

Problem

Research questions and friction points this paper is trying to address.

Generates 3D object and camera trajectories from natural language

Improves motion controllability in video generation via LLM planning

Creates structured motion programs using a cinematography-inspired DSL

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs plan motion from natural language descriptions

Motion DSL translates programs to 3D trajectories

Procedural dataset pairs text with motion programs

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning