🤖 AI Summary
This work addresses the challenge of learning compact, temporally aware visual representations for dynamic scenes. We propose the Token Bottleneck framework, which compresses an entire dynamic scene frame into a single bottleneck token and reconstructs future frames in a self-supervised manner using only a few image patches as prompts. Crucially, temporal dependencies across frames are implicitly modeled during the token compression and decompression processes—eliminating the need for explicit temporal modules or additional parameters. Built upon standard vision backbone networks, the framework enables lightweight sequence modeling. Extensive evaluation demonstrates significant improvements over baselines on video label propagation and both simulated and real-world robotic manipulation tasks. These results validate the framework’s effectiveness in modeling dynamic scene evolution, strong generalization capability, and scalability across model sizes.
📝 Abstract
Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.