TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the high computational and memory demands of SAM 2, which stem from its multi-stage image encoder and memory module, hindering efficient deployment. To overcome this, the authors propose TinySAM 2, a streamlined variant that significantly reduces computational and memory costs while preserving strong video segmentation performance. The approach integrates a lightweight RepViT encoder, a joint spatial-temporal token compression strategy, and a similarity-based temporal token selection mechanism, complemented by a memory quality control scheme. Remarkably, TinySAM 2 achieves 90% of SAM 2.1’s performance using only 7% of its memory tokens and 3% of its training data. Extensive evaluations on benchmarks such as DAVIS and SA-V demonstrate that TinySAM 2 substantially lowers parameter count and computational load without compromising segmentation accuracy.
📝 Abstract
Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.
Problem

Research questions and friction points this paper is trying to address.

video segmentation
model deployment
memory compression
computational efficiency
resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory compression
joint spatial-temporal token pruning
lightweight video segmentation
memory bank optimization
RepViT encoder
🔎 Similar Papers