SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
Existing approaches to multi-prompt long video generation struggle to balance semantic flexibility with visual coherence, often suffering from computational redundancy and insufficient semantic adaptability due to fixed memory budgets or cache reconstruction. This work proposes SWIFT, a training-free framework that enables efficient streaming generation through lightweight semantic-injected caching, head-level attention modulation, adaptive dynamic temporal windows, and segment-level semantic anchors. SWIFT introduces the first training-free, semantics-aware cache update mechanism, achieving 22.6 FPS on a single H100 GPU—significantly outperforming state-of-the-art methods in inference efficiency while maintaining high-quality output.
📝 Abstract
Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.
Problem

Research questions and friction points this paper is trying to address.

long-video generation
semantic switching
temporal coherence
prompt adaptation
memory efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Injection Cache
Head-wise Semantic Injection
Adaptive Dynamic Window
Segment-level Semantic Anchors
Prompt-Adaptive Memory
🔎 Similar Papers