Scoring, Remember, and Reference: Catching Camouflaged Objects in Videos

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video Camouflaged Object Detection (VCOD) is highly challenging due to the strong visual similarity between camouflaged objects and their backgrounds, coupled with insufficient exploitation of dynamic cues. To address these issues, we propose SRR, an end-to-end framework inspired by human memory-recognition mechanisms, establishing a synergistic scoring-memory-reference paradigm. Specifically, we design a reference-guided multi-level asymmetric attention module to jointly model long-term reference frames and short-term motion features; introduce a dual-task decoder for simultaneous mask and confidence prediction; adopt a lightweight single-pass inference architecture; and propose a learnable memory-reference frame selection strategy. Evaluated on mainstream benchmarks, SRR achieves approximately 10% performance gain with only 54M parameters—significantly outperforming state-of-the-art methods. Our approach offers a novel, efficient, and robust solution for video camouflaged object detection.

Technology Category

Application Category

📝 Abstract
Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature extraction.Furthermore, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Detect camouflaged objects in videos with similar backgrounds
Improve dynamic information use in video object detection
Integrate memory-guided attention for better feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory reference frames for sequence processing
Dual-purpose decoder with mask and score generation
Reference-guided multilevel asymmetric attention mechanism
🔎 Similar Papers
No similar papers found.
Y
Yuang Feng
Fudan University, Shanghai, China
Shuyong Gao
Shuyong Gao
Fudan University
Human Visual AttentionGenerative ModelWeakly Supervised Learning
F
Fuzhen Yan
Fudan University, Shanghai, China
Yicheng Song
Yicheng Song
Fudan University, Shanghai, China
Lingyi Hong
Lingyi Hong
Fudan University
Computer Vision
J
Junjie Hu
Fudan University, Shanghai, China
W
Wenqiang Zhang
Fudan University, Shanghai, China