One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing video tokenization methods rely on fixed spatiotemporal blocks, leading to redundant tokens, computational inefficiency, and difficulty in jointly accommodating camera motion and representation fidelity. To address this, we propose TrajViT, a trajectory-driven panoptic sub-object tokenization paradigm. TrajViT introduces perception-consistent, trajectory-aware tokenization—marking the first such approach—where token count scales with semantic scene complexity rather than video duration. Our method integrates panoptic segmentation, multi-object trajectory tracking, contrastive learning, and a lightweight trajectory encoder to produce semantically aligned and temporally coherent video representations. Extensive experiments demonstrate consistent superiority over ViT3D across multiple tasks: +6% Top-5 recall in video–text retrieval and +5.2% average accuracy in VideoQA. Moreover, TrajViT accelerates training by 4×, reduces inference FLOPs by 18×, and cuts token count by 10×.

Technology Category

Application Category

📝 Abstract

Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

Problem

Research questions and friction points this paper is trying to address.

Excessive tokens from space-time patches in video tokenization

Performance degradation with current token reduction strategies

Need for efficient, semantically meaningful video tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tokenizes videos via panoptic sub-object trajectories

Reduces tokens while maintaining temporal coherence

Uses contrastive learning for superior performance

🔎 Similar Papers

No similar papers found.