Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Few-shot action recognition (FSAR) suffers from sparse label semantics and difficulty in modeling fine-grained action structures—such as actors, motion patterns, and interaction targets. To address this, we propose a language-guided action dissection framework: (1) leveraging large language models to structurally parse action labels into three core semantic elements; (2) designing a visual dissection module that decomposes videos into semantically aligned atomic phases; and (3) establishing dual matching mechanisms—video–text and video–video—to enable cross-modal fine-grained alignment and prototype learning. Our approach is the first to deeply integrate textual structural priors of actions with visual phase decomposition, significantly enhancing few-shot generalization. Extensive experiments demonstrate state-of-the-art performance across multiple FSAR benchmarks, validating both the effectiveness and robustness of our language-guided dissection strategy.

Technology Category

Application Category

📝 Abstract

Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object). For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions. A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multipe FSAR benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Classify human actions in videos with few labeled samples

Capture subtle posture, motion, and object interaction variations

Integrate textual and visual features for robust few-shot classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs to dissect action labels

Segments actions into atomic video phases

Integrates textual and visual features atomically

🔎 Similar Papers

A Comprehensive Review of Few-shot Action Recognition