TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the high computational and memory demands of SAM 2, which stem from its multi-stage image encoder and memory module, hindering efficient deployment. To overcome this, the authors propose TinySAM 2, a streamlined variant that significantly reduces computational and memory costs while preserving strong video segmentation performance. The approach integrates a lightweight RepViT encoder, a joint spatial-temporal token compression strategy, and a similarity-based temporal token selection mechanism, complemented by a memory quality control scheme. Remarkably, TinySAM 2 achieves 90% of SAM 2.1’s performance using only 7% of its memory tokens and 3% of its training data. Extensive evaluations on benchmarks such as DAVIS and SA-V demonstrate that TinySAM 2 substantially lowers parameter count and computational load without compromising segmentation accuracy.

📝 Abstract

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

Problem

Research questions and friction points this paper is trying to address.

video segmentation

model deployment

memory compression

computational efficiency

resource-constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory compression

joint spatial-temporal token pruning

lightweight video segmentation