CamSAM2: Segment Anything Accurately in Camouflaged Videos

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address SAM2’s poor segmentation accuracy on environment-blended camouflaged objects in video camouflaged object segmentation (VCOS) under simple prompts (e.g., points or bounding boxes), this paper proposes a lightweight, fine-tuning-free enhancement framework. Our method introduces three key innovations: (1) a *de-camouflage token* that explicitly encodes camouflage-suppression priors; (2) an implicit/explicit object-aware fusion module enabling inter-frame feature alignment via intra-object flow (IOF) and inter-object flow (EOF); and (3) an *object prototype generation* (OPG) mechanism that constructs frame-consistent prototype memory. The framework is compatible with hierarchical vision backbones such as Hiera-T. Evaluated on three major VCOS benchmarks, it significantly outperforms SAM2: +12.2 mDice under click prompts on MoCA-Mask and +19.6 mDice under mask prompts on SUN-SEG-Hard.

Technology Category

Application Category

📝 Abstract

Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at href{https://github.com/zhoustan/CamSAM2}{github.com/zhoustan/CamSAM2}.

Problem

Research questions and friction points this paper is trying to address.

Enhancing SAM2 for camouflaged object segmentation in videos

Improving segmentation with decamouflaged tokens and fusion modules

Boosting accuracy on VCOS datasets with minimal added parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces decamouflaged token for feature adjustment

Proposes implicit and explicit object-aware fusion modules

Uses object prototype generation for informative details

🔎 Similar Papers

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation