PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the linear growth of KV cache in unified autoregressive video generation, which severely limits inference efficiency and maximum generation length. For the first time, it reveals the spatiotemporal attention characteristics of text/image conditions and video frames within the KV cache, and leverages these insights to propose a training-free dynamic cache compression method. By preserving semantic anchors, modeling inter-frame attention decay, and maintaining spatial positional consistency, the approach efficiently prunes redundant cache entries. The method achieves end-to-end speedups of 1.7–2.2× for 48-frame video generation, with the final four frames accelerated by 2.6× on A40 and 3.7× on H200 GPUs, significantly enhancing long-video generation efficiency.

Technology Category

Application Category

📝 Abstract
A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.
Problem

Research questions and friction points this paper is trying to address.

KV-cache
autoregressive video generation
inference efficiency
unified multimodal modeling
sequence length limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache compression
training-free acceleration
autoregressive video generation
spatiotemporal attention modeling
unified multimodal transformers
🔎 Similar Papers
No similar papers found.