Looking Backward: Streaming Video-to-Video Translation with Feature Banks

📅 2024-05-24

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the challenge of simultaneously achieving infinite-frame processing and temporal consistency in real-time streaming video-to-video (V2V) translation, this paper introduces the first diffusion-based V2V architecture designed explicitly for streaming scenarios. Methodologically, we propose a backward-looking feature bank mechanism that dynamically stores historical features and directly fuses them into self-attention computation, thereby extending cross-frame attention without requiring model fine-tuning—enabling plug-and-play integration with existing image diffusion models. Our contributions are fourfold: (1) establishing the first streaming-aware V2V diffusion paradigm; (2) introducing a novel feature-bank-based temporal modeling mechanism; (3) achieving 20 FPS on a single A100 GPU—15× to 158× faster than FlowVid; and (4) demonstrating significant improvements in temporal consistency through both quantitative evaluation and user studies.

Technology Category

Application Category

📝 Abstract

This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

Problem

Research questions and friction points this paper is trying to address.

Real-time video-to-video translation

Streaming frame processing

Temporal consistency maintenance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming video translation with feature banks

Backward-looking principle for temporal consistency

Real-time processing at 20 FPS on A100 GPU

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding