Video-T1: Test-Time Scaling for Video Generation

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the quality limitations of text-to-video generation under complex prompts. We propose a novel test-time scaling paradigm, modeling video generation as a trajectory search from Gaussian noise space to the target distribution. Our method introduces Tree-of-Frames (ToF), an autoregressive branching and scaling mechanism that enables efficient, inference-time quality enhancement without significant training overhead. Additionally, we integrate a test-time verifier with linear-noise candidate search to jointly optimize diffusion trajectories. To our knowledge, this is the first systematic application of test-time scaling to video generation. Evaluated on multiple benchmarks, our approach consistently improves performance—reducing FVD and increasing CLIP-Score—using only additional inference compute. It significantly enhances both semantic consistency and visual fidelity.

Technology Category

Application Category

📝 Abstract
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1
Problem

Research questions and friction points this paper is trying to address.

Exploring Test-Time Scaling to enhance video generation quality
Developing efficient search strategies for better video trajectories
Reducing heavy computation costs in video generation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling for video generation
Searching better trajectories with verifiers
Tree-of-Frames for efficient video branching
🔎 Similar Papers
No similar papers found.