TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Video foundation models (VFMs) exhibit insufficient spatiotemporal alignment capabilities in compositional scenarios involving motion, quantity, and spatial relations. To address this, we propose TTOM—a training-free, test-time optimization framework that jointly optimizes learnable parameters and layout-guided attention objectives to achieve fine-grained text-video spatiotemporal semantic alignment during inference only. Our key contributions are: (1) a parameterized memory module supporting dynamic read/write operations, which decouples compositional knowledge while preserving historical context; and (2) a general-purpose layout-aware attention supervision mechanism that operates without modifying the original model weights. TTOM demonstrates significant improvements in compositional generation accuracy on T2V-CompBench and VBench, validating its effectiveness, generalizability, computational efficiency, and scalability across diverse VFM architectures.

Technology Category

Application Category

📝 Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

Problem

Research questions and friction points this paper is trying to address.

Improves text-video alignment in compositional video generation scenarios

Optimizes video outputs using spatiotemporal layouts during inference

Enhances motion, numeracy and spatial relation capabilities in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes new parameters using layout-attention objective during inference

Implements parametric memory for historical optimization context management

Enables streaming video generation with flexible memory operations

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling