FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

To address the high MLP computational overhead and excessive temporal redundancy between adjacent frames in edge-deployed autoregressive video generation, this paper proposes a cache-aware replay mechanism. We innovatively design Temporal Attention Scores (TAS) to dynamically determine optimal cache reuse opportunities, and integrate an FPGA-based hardware accelerator with Dynamic Resource Scheduling (DRS) to jointly optimize inference efficiency. The framework further incorporates sparse attention to reduce compute load while preserving spatiotemporal coherence. Evaluated on edge platforms, our method achieves over 2.1× decoding speedup and significantly improved energy efficiency without compromising visual quality. Moreover, it effectively mitigates cumulative drift in long-duration, high-resolution video generation—outperforming state-of-the-art sparse attention approaches in both fidelity and stability.

Technology Category

Application Category

📝 Abstract

Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the extbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy ( extit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car

Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in AR video generation

Addressing temporal redundancy in MLP outputs

Improving edge device efficiency for video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reuses cached MLP outputs to reduce redundancy

Introduces Temporal Attention Score for replay strategy

Develops FPGA accelerator with Dynamic Resource Scheduling

🔎 Similar Papers

FutureFill: Fast Generation from Convolutional Sequence Models