OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
This work addresses the limitations of existing 4D occupancy generation methods, which struggle to model complex, temporally coherent multi-agent interactions due to reliance on rigid geometric constraints or simplistic textual attributes. To overcome this, we propose OccDirector, the first framework capable of generating physically plausible 4D dynamic occupancy sequences driven solely by natural language instructions, enabling end-to-end orchestration of multi-agent behaviors. Our key contributions include the construction of OccInteract-85k, a large-scale dataset with multi-granularity language annotations; the design of a vision-language model (VLM)-driven spatiotemporal MMDiT architecture augmented with a history prefix anchoring strategy to ensure long-horizon consistency; and the establishment of a VLM-based evaluation benchmark. Experiments demonstrate that our approach significantly outperforms prior methods in both generation quality and instruction-following fidelity, enabling precise control over intricate multi-agent interactions in complex traffic scenarios.

Technology Category

Application Category

📝 Abstract
Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.
Problem

Research questions and friction points this paper is trying to address.

4D occupancy
multi-agent interaction
language-guided generation
autonomous driving simulation
spatio-temporal dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D occupancy generation
language-guided behavior
multi-agent interaction
spatio-temporal MMDiT
VLM-driven simulation