Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe background interference and degraded temporal modeling in few-shot action recognition (FSAR) for wide-field-of-view videos, this paper proposes Otter—a novel framework. Methodologically, Otter introduces: (1) a composite segmentation module that adaptively focuses on salient image patches to suppress background noise; (2) a bidirectional temporal reconstruction module that jointly models inter-frame dynamic evolution; and (3) a dual-prototype collaborative architecture integrating visual prompting, local-global attention, and temporal-enhanced prototype construction—enabling joint optimization of subject-aware representation and temporal modeling. Built upon an enhanced RWKV backbone, Otter achieves state-of-the-art performance on SSv2, Kinetics, UCF101, and HMDB51. Moreover, it significantly outperforms existing methods on VideoBadminton, a challenging real-world wide-field-of-view dataset.

Technology Category

Application Category

📝 Abstract
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
Problem

Research questions and friction points this paper is trying to address.

Mitigating background distractions in wide-angle few-shot action recognition
Reconstructing temporal relations degraded by similar background frames
Enhancing subject emphasis and global modeling in action recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compound Segmentation Module emphasizes key patches
Temporal Reconstruction Module enables bidirectional scanning
Combines regular and temporal-enhanced prototypes for modeling
🔎 Similar Papers
No similar papers found.
Wenbo Huang
Wenbo Huang
Southeast University | Institute of Science Tokyo
Video AnalysisMultimediaUbiquitous Computing
J
Jinghui Zhang
Southeast University, Nanjing 211189, Jiangsu, China
Z
Zhenghao Chen
The University of Newcastle, Callaghan, NSW 2308, Australia
Guang Li
Guang Li
Assistant Professor, Hokkaido University
Dataset DistillationSelf-Supervised LearningData-Centric AIMedical Image Analysis
L
Lei Zhang
Nanjing Normal University, Nanjing 210023, Jiangsu, China
Y
Yang Cao
Institute of Science Tokyo, Tokyo 152-8550, Japan
Fang Dong
Fang Dong
Southeast University
Edge CompuingCloudAIOT
Takahiro Ogawa
Takahiro Ogawa
Hokkaido University
Multimedia ProcessingAIIoTBig Data Analysis
M
M. Haseyama
Hokkaido University, Sapporo 060-0808, Hokkaido, Japan