Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of existing few-shot temporal action localization (TAL) methods—namely, neglect of textual semantics and poor generalization to unseen action classes—this paper pioneers the integration of Chain-of-Thought (CoT) reasoning into TAL, proposing a text-driven cross-modal inference paradigm. Methodologically, we design a semantic-aware cross-modal alignment module that synergistically leverages vision-language models (VLMs) and large language models (LLMs), enabling multi-granularity text-video alignment and CoT-guided textual reasoning to enhance semantic understanding of novel action categories. Furthermore, we introduce the first few-shot TAL benchmark tailored for human anomaly detection. Extensive experiments on ActivityNet1.3 and THUMOS14 demonstrate state-of-the-art performance in both single- and multi-instance settings, significantly outperforming prior approaches. Our code, pretrained models, and the new benchmark are publicly released.

Technology Category

Application Category

📝 Abstract
Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.
Problem

Research questions and friction points this paper is trying to address.

Few-shot TAL lacks textual semantic support
Need to align text-visual data for action localization
Improve temporal reasoning with Chain-of-Thought text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought textual reasoning for TAL
Semantic-aware text-visual alignment module
CoT-like reasoning with VLM and LLM
🔎 Similar Papers
No similar papers found.
H
Hongwei Ji
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
W
Wu Yun
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
M
Mengshi Qi
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Huadong Ma
Huadong Ma
BUPT
Internet of ThingsMultimedia