Tracking and Segmenting Anything in Any Modality

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing video understanding methods often rely on modality-specific architectures or parameters, hindering simultaneous cross-modal generalization and multi-task synergy while neglecting modality distribution shifts and task representation heterogeneity. To address this, we propose the first universal framework for video object tracking and segmentation applicable to arbitrary modalities. Our approach introduces a decoupled Mixture-of-Experts (DeMoE) mechanism that explicitly separates cross-modal shared knowledge from task-specific representations. We further design a unified instance encoder, multimodal feature alignment module, and task-aware tracking decoder to enable joint multimodal–multitask optimization. Evaluated on 18 mainstream benchmarks, our method achieves state-of-the-art performance, significantly improving cross-modal transferability and multitask generalization. This work establishes a new paradigm for universal visual modeling.

Technology Category

Application Category

📝 Abstract

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

Problem

Research questions and friction points this paper is trying to address.

Unifying tracking and segmentation across different input modalities

Addressing distribution gaps between modalities and feature representation gaps across tasks

Enabling effective cross-task and cross-modal knowledge sharing for generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Mixture-of-Expert mechanism for cross-modal learning

Task-aware Multi-object Tracking pipeline unifying instance outputs

Universal framework handling multiple tracking and segmentation tasks

🔎 Similar Papers

No similar papers found.