Programmatic Video Prediction Using Large Language Models

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional end-to-end black-box approaches for video future-frame prediction suffer from poor interpretability and limited controllability. Method: This work proposes the first programmatic video world modeling framework leveraging large language models (LLMs) and vision-language models (VLMs) for video forecasting. It formalizes video dynamics as a neural-symbolic frame-wise state sequence and employs LLM/VLM collaboration to generate three executable, differentiable programs: state encoding, state evolution reasoning, and differentiable RGB rendering. Contribution/Results: The framework enables counterfactual reasoning and human-interpretable generation. Evaluated on two challenging physics simulation benchmarks—PhyWorld and Cart Pole—it significantly outperforms existing state-of-the-art methods, achieving unified improvements in prediction fidelity, interpretability, and controllability.

Technology Category

Application Category

📝 Abstract
The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Predicting future video frames using interpretable neuro-symbolic states
Leveraging LLM/VLM to estimate and transition video dynamics
Generating plausible visual futures for robotics and autonomous systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses neuro-symbolic states for video dynamics
Leverages LLM/VLM for program synthesis
Renders predicted states as RGB frames
H
Hao Tang
Department of Computer Science, Cornell University
Kevin Ellis
Kevin Ellis
Cornell
Suhas Lohit
Suhas Lohit
Principal Research Scientist, Mitsubishi Electric Research Laboratories
Computer VisionDeep LearningComputational ImagingVisual Reasoning
M
Michael J. Jones
Mitsubishi Electric Research Laboratories (MERL)
M
Moitreya Chatterjee
Mitsubishi Electric Research Laboratories (MERL)