π€ AI Summary
Video foundation models (VFMs) exhibit insufficient spatiotemporal alignment capabilities in compositional scenarios involving motion, quantity, and spatial relations. To address this, we propose TTOMβa training-free, test-time optimization framework that jointly optimizes learnable parameters and layout-guided attention objectives to achieve fine-grained text-video spatiotemporal semantic alignment during inference only. Our key contributions are: (1) a parameterized memory module supporting dynamic read/write operations, which decouples compositional knowledge while preserving historical context; and (2) a general-purpose layout-aware attention supervision mechanism that operates without modifying the original model weights. TTOM demonstrates significant improvements in compositional generation accuracy on T2V-CompBench and VBench, validating its effectiveness, generalizability, computational efficiency, and scalability across diverse VFM architectures.
π Abstract
Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.