Learning Plug-and-play Memory for Guiding Video Diffusion Models

πŸ“… 2025-11-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing DiT-based video generation models achieve high visual fidelity but often violate physical laws and commonsense dynamics due to the absence of explicit world knowledge modeling. To address this, we propose DiT-Mem: a plug-and-play, learnable memory encoder that decouples appearance and physical-semantic cues via 3D CNNs combined with high-/low-pass filtering, and encodes reference videos into memory tokens injected into DiT’s self-attention layers for physically grounded generation guidance. DiT-Mem supports frozen-backbone training, requiring only 150M parameters and 10K samples for efficient optimization. Extensive experiments demonstrate significant improvements in physical plausibility and visual fidelity across multiple benchmarks. Moreover, DiT-Mem exhibits strong plug-and-play compatibility and cross-model generalization capability. Code and data are publicly available.

Technology Category

Application Category

πŸ“ Abstract
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
Problem

Research questions and friction points this paper is trying to address.

Video diffusion models violate physical laws and commonsense dynamics
Current models lack explicit world knowledge for realistic generation
Need plug-and-play memory to inject useful world knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play memory encoder for video diffusion models
Memory tokens injected into DiT self-attention layers
Frozen backbone with optimized memory using few parameters
S
Selena Song
University of California, San Diego
Z
Ziming Xu
University of California, San Diego
Z
Zijun Zhang
University of California, San Diego
K
Kun Zhou
University of California, San Diego
Jiaxian Guo
Jiaxian Guo
Google Research
Efficient Foundation ModelReinforcement LearningCausality
Lianhui Qin
Lianhui Qin
UC San Diego, Computer Science and Engineering
Natural Language ProcessingMachine Learning
Biwei Huang
Biwei Huang
UCSD
CausalityMachine LearningComputational Science