SAM 2++: Tracking Anything at Any Granularity

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video trackers are typically task-specific, relying on customized modules that suffer from poor generalization and parameter redundancy. To address this, we propose the first unified framework for multi-granularity video tracking—supporting masks, bounding boxes, and points—built upon a task-adaptive prompt encoder and a unified decoder for consistent target representation and prediction across granularities. We further introduce a task-adaptive memory mechanism to enable unified cross-granularity feature matching. Additionally, we establish Tracking-Any-Granularity, the first large-scale benchmark supporting arbitrary-granularity tracking, and develop a multi-granularity data engine to generate high-quality annotations. Our framework achieves state-of-the-art performance across multiple benchmarks, significantly improving model generalization, robustness, and reusability—thereby establishing a new paradigm for generic video tracking.

Technology Category

Application Category

📝 Abstract
Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
Problem

Research questions and friction points this paper is trying to address.

Unifying video tracking tasks across different granularities
Eliminating redundancy in model design and parameters
Enabling memory matching across diverse target states
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified decoder converts diverse task outputs
Task-adaptive memory mechanism for matching
Customized data engine supports multi-granularity training
🔎 Similar Papers
No similar papers found.
J
Jiaming Zhang
State Key Laboratory for Novel Software Technology, Nanjing University
Cheng Liang
Cheng Liang
Shanghai AI Lab
VLM
Y
Yichun Yang
State Key Laboratory for Novel Software Technology, Nanjing University
C
Chenkai Zeng
State Key Laboratory for Novel Software Technology, Nanjing University
Yutao Cui
Yutao Cui
Tencent Hunyuan
Generative ModelsMulti-ModalObject Tracking
Xinwen Zhang
Xinwen Zhang
State Key Laboratory for Novel Software Technology, Nanjing University
X
Xin Zhou
State Key Laboratory for Novel Software Technology, Nanjing University
K
Kai Ma
Platform and Content Group (PCG), Tencent
G
Gangshan Wu
State Key Laboratory for Novel Software Technology, Nanjing University
L
Limin Wang
OpenGVLab, Shanghai AI Laboratory