TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

πŸ“… 2025-10-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video foundation models (VFMs) exhibit insufficient spatiotemporal alignment capabilities in compositional scenarios involving motion, quantity, and spatial relations. To address this, we propose TTOMβ€”a training-free, test-time optimization framework that jointly optimizes learnable parameters and layout-guided attention objectives to achieve fine-grained text-video spatiotemporal semantic alignment during inference only. Our key contributions are: (1) a parameterized memory module supporting dynamic read/write operations, which decouples compositional knowledge while preserving historical context; and (2) a general-purpose layout-aware attention supervision mechanism that operates without modifying the original model weights. TTOM demonstrates significant improvements in compositional generation accuracy on T2V-CompBench and VBench, validating its effectiveness, generalizability, computational efficiency, and scalability across diverse VFM architectures.

Technology Category

Application Category

πŸ“ Abstract
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
Problem

Research questions and friction points this paper is trying to address.

Improves text-video alignment in compositional video generation scenarios
Optimizes video outputs using spatiotemporal layouts during inference
Enhances motion, numeracy and spatial relation capabilities in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes new parameters using layout-attention objective during inference
Implements parametric memory for historical optimization context management
Enables streaming video generation with flexible memory operations
πŸ”Ž Similar Papers
Leigang Qu
Leigang Qu
National University of Singapore
Multimedia ComputingInformation Retrieval
Z
Ziyang Wang
National University of Singapore
Na Zheng
Na Zheng
National University of Singapore
W
Wenjie Wang
University of Science and Technology of China
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)
T
Tat-Seng Chua
National University of Singapore