TV2TV: A Unified Framework for Interleaved Language and Video Generation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing video generation models struggle to model video content requiring multi-step semantic reasoning or complex action sequences. To address this, we propose TV2TV—the first unified framework that explicitly integrates language-based reasoning mechanisms into video generation. Our approach introduces an alternating “textual reasoning → visual execution” generation paradigm, enabling fine-grained textual intervention and dynamic trajectory decision-making. Built upon a Mixture-of-Transformers (MoT) architecture, TV2TV jointly optimizes language modeling and spatiotemporal video token alignment, while leveraging vision-language models to generate natural-language-annotated video trajectories. Evaluated on game and sports video generation tasks, TV2TV achieves significant improvements in visual fidelity, prompt alignment, and temporal action modeling—demonstrating enhanced controllability and reasoning capability in realistic scenarios.

Technology Category

Application Category

📝 Abstract

Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

Problem

Research questions and friction points this paper is trying to address.

Addresses complex video generation requiring semantic branching and reasoning.

Integrates language modeling with video generation for improved visual quality.

Enables fine-grained controllability through text interventions during video creation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved text-video generation process

Mixture-of-Transformers for joint language-video modeling

Text-guided reasoning for improved video quality and control

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs