Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot action recognition (FSAR) methods suffer from two key limitations: (1) set-matching based on cosine similarity fails to capture nonlinear inter-frame dependencies and neglects task-specific cues; (2) auxiliary side layers introduced in lightweight CLIP fine-tuning exhibit optimization instability under few-shot conditions. To address these, we propose a task-driven distance-correlation matching framework coupled with a Ladder Side Network (LSN) adaptation strategy. Specifically, we introduce a novel task-prototype-guided α-distance correlation metric that explicitly models both nonlinear frame-wise relationships and task semantics. We further design a Guiding LSN mechanism, wherein learnable side layers are jointly regularized with frozen CLIP parameters to ensure stability under low-data regimes while preserving discriminability. Evaluated on five mainstream benchmarks, our method consistently surpasses state-of-the-art approaches under both 1-shot and 5-shot settings, achieving significant gains in accuracy and reduced memory overhead.

Technology Category

Application Category

📝 Abstract
Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $α$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $α$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in few-shot action recognition set matching
Improves efficient CLIP adaptation under limited data conditions
Enhances modeling of complex inter-frame dependencies for FSAR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ladder Side Network for efficient CLIP fine-tuning
Task-Specific Distance Correlation Matching for nonlinear dependencies
Guiding LSN with Adapted CLIP module for regularization
🔎 Similar Papers
No similar papers found.
F
Fei Long
School of Information and Communication Engineering, Dalian University of Technology
Y
Yao Zhang
School of Information and Communication Engineering, Dalian University of Technology
J
Jiaming Lv
School of Information and Communication Engineering, Dalian University of Technology
J
Jiangtao Xie
School of Information and Communication Engineering, Dalian University of Technology
Peihua Li
Peihua Li
Dalian University of Technology
Computer VisionDeep LearningStatistical Modeling