Resi-VidTok: An Efficient and Decomposed Progressive Tokenization Framework for Ultra-Low-Rate and Lightweight Video Transmission

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address real-time video transmission over extremely bandwidth-constrained (down to 0.0004 bitrate) and high-packet-loss wireless channels, this paper proposes an efficient progressive 1D tokenization framework. The method decomposes video into a resilient spatiotemporal token stream—comprising ordered keyframe tokens and refinement tokens—and enables continuous video reconstruction from incomplete token sets using only a shared frame-level decoder. Key technical contributions include: (i) differential temporal coding for motion coherence; (ii) prefix-decodable reconstruction; (iii) lightweight decoder-side interpolation; and (iv) channel-adaptive joint source-channel coding and modulation. Evaluated under severe network conditions, the framework achieves real-time decoding at >30 fps while preserving motion continuity and semantic consistency. It significantly enhances robustness and energy efficiency compared to state-of-the-art approaches in weak-network scenarios.

Technology Category

Application Category

📝 Abstract

Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.

Problem

Research questions and friction points this paper is trying to address.

Enables ultra-low-rate video transmission under severe bandwidth constraints

Achieves robust video reconstruction with incomplete token sets

Maintains real-time performance and semantic fidelity on lightweight hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive tokenization with importance-ordered key and refinement tokens

Resilient 1D tokenization pipeline using differential temporal token coding

Channel-adaptive source-channel coding with dynamic rate allocation

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval