CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the challenge of effectively modeling the interplay between appearance and motion cues in unsupervised video object segmentation by proposing a cross-modal token modulation mechanism. Within a two-stream architecture, a relation-aware Transformer enables dense interactions between appearance and motion features, facilitating efficient intra- and inter-modal information propagation. The method further introduces an innovative token masking strategy that enhances learning efficiency without increasing model complexity. Evaluated across all public benchmarks for unsupervised video object segmentation, the proposed approach consistently achieves state-of-the-art performance, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract
Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.
Problem

Research questions and friction points this paper is trying to address.

unsupervised video object segmentation
cross-modal interaction
appearance and motion cues
token modulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Token Modulation
Unsupervised Video Object Segmentation
Two-Stream Architecture
Relation Transformer
Token Masking
🔎 Similar Papers
No similar papers found.