🤖 AI Summary
This work addresses the challenge of efficiently synchronizing discrete latent states of generative driving world models in vehicular networks characterized by severe bandwidth constraints and high packet loss. To this end, the authors propose a fully online, label-free adaptive streaming algorithm that prioritizes incremental updates based on cosine distance in the codebook embedding space and dynamically triggers keyframes using a Hamming drift threshold, enabling bitrate-constrained adaptive keyframe scheduling. Leveraging a stride-16 VQ-U-Net tokenizer and a keyframe–incremental protocol, the method achieves a 7.2% reduction in embedding distortion and a 6.3% improvement in perplexity at a bitrate of 0.024 Mb/s. It also significantly outperforms fixed-period strategies under a 10% packet loss rate, demonstrating the practicality and robustness of discrete token streams in vehicular communication environments.
📝 Abstract
Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network-efficient streaming of a discrete world model state, where a stride-16 VQ-U-Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed-length coding. We consider a keyframe--delta protocol under strict per-message payload budgets and packet loss, and propose a fully online, label-free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming-drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200-byte budget) dynamic-only embedding distortion drops from 0.0712 to 0.0661 (7.2\%), and at 0.036 Mb/s (400-byte budget) from 0.0427 to 0.0407 (4.8\%). Under 10\% delta packet loss at 200 bytes, dynamic-only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next-token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic-position perplexity improves from 206.0 to 193.1 (6.3\%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1\%). These results support discrete token-state streaming as a practical systems layer for bandwidth-aware synchronization and improved downstream token-dynamics utility under vehicular networking constraints.