🤖 AI Summary
To address real-time video transmission over extremely bandwidth-constrained (down to 0.0004 bitrate) and high-packet-loss wireless channels, this paper proposes an efficient progressive 1D tokenization framework. The method decomposes video into a resilient spatiotemporal token stream—comprising ordered keyframe tokens and refinement tokens—and enables continuous video reconstruction from incomplete token sets using only a shared frame-level decoder. Key technical contributions include: (i) differential temporal coding for motion coherence; (ii) prefix-decodable reconstruction; (iii) lightweight decoder-side interpolation; and (iv) channel-adaptive joint source-channel coding and modulation. Evaluated under severe network conditions, the framework achieves real-time decoding at >30 fps while preserving motion continuity and semantic consistency. It significantly enhances robustness and energy efficiency compared to state-of-the-art approaches in weak-network scenarios.
📝 Abstract
Real-time transmission of video over wireless networks remains highly challenging, even with advanced deep models, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose Resi-VidTok, a Resilient Tokenization-Enabled framework designed for ultra-low-rate and lightweight video transmission that delivers strong robustness while preserving perceptual and semantic fidelity on commodity digital hardware. By reorganizing spatio--temporal content into a discrete, importance-ordered token stream composed of key tokens and refinement tokens, Resi-VidTok enables progressive encoding, prefix-decodable reconstruction, and graceful quality degradation under constrained channels. A key contribution is a resilient 1D tokenization pipeline for video that integrates differential temporal token coding, explicitly supporting reliable recovery from incomplete token sets using a single shared framewise decoder--without auxiliary temporal extractors or heavy generative models. Furthermore, stride-controlled frame sparsification combined with a lightweight decoder-side interpolator reduces transmission load while maintaining motion continuity. Finally, a channel-adaptive source--channel coding and modulation scheme dynamically allocates rate and protection according to token importance and channel condition, yielding stable quality across adverse SNRs. Evaluation results indicate robust visual and semantic consistency at channel bandwidth ratios (CBR) as low as 0.0004 and real-time reconstruction at over 30 fps, demonstrating the practicality of Resi-VidTok for energy-efficient, latency-sensitive, and reliability-critical wireless applications.