🤖 AI Summary
Existing video tokenization methods rely on fixed spatiotemporal blocks, leading to redundant tokens, computational inefficiency, and difficulty in jointly accommodating camera motion and representation fidelity. To address this, we propose TrajViT, a trajectory-driven panoptic sub-object tokenization paradigm. TrajViT introduces perception-consistent, trajectory-aware tokenization—marking the first such approach—where token count scales with semantic scene complexity rather than video duration. Our method integrates panoptic segmentation, multi-object trajectory tracking, contrastive learning, and a lightweight trajectory encoder to produce semantically aligned and temporally coherent video representations. Extensive experiments demonstrate consistent superiority over ViT3D across multiple tasks: +6% Top-5 recall in video–text retrieval and +5.2% average accuracy in VideoQA. Moreover, TrajViT accelerates training by 4×, reduces inference FLOPs by 18×, and cuts token count by 10×.
📝 Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.