CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of effectively modeling the interplay between appearance and motion cues in unsupervised video object segmentation by proposing a cross-modal token modulation mechanism. Within a two-stream architecture, a relation-aware Transformer enables dense interactions between appearance and motion features, facilitating efficient intra- and inter-modal information propagation. The method further introduces an innovative token masking strategy that enhances learning efficiency without increasing model complexity. Evaluated across all public benchmarks for unsupervised video object segmentation, the proposed approach consistently achieves state-of-the-art performance, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

Problem

Research questions and friction points this paper is trying to address.

unsupervised video object segmentation

cross-modal interaction

appearance and motion cues

token modulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Token Modulation

Unsupervised Video Object Segmentation

Two-Stream Architecture

Relation Transformer