The Role of Video Generation in Enhancing Data-Limited Action Understanding

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the performance bottleneck in video action understanding caused by scarce labeled data, this paper proposes an automatic annotation data generation framework based on a text-to-video diffusion Transformer. The method eliminates reliance on human annotation and enables unlimited synthesis of high-quality training samples. Its core contributions are: (1) an information-augmentation strategy—novel in this domain—that explicitly enhances semantic richness of generated videos along scene and actor dimensions; and (2) an uncertainty-driven adaptive label smoothing mechanism that mitigates training instability induced by heterogeneous sample quality. Extensive experiments validate effectiveness across four standard benchmarks and five downstream tasks. Notably, the approach achieves state-of-the-art zero-shot action recognition performance, demonstrating both scalability and generalization capability without requiring any ground-truth labels.

Technology Category

Application Category

📝 Abstract

Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in video action understanding tasks

Generating realistic annotated data without human intervention

Improving model training with enhanced samples and label smoothing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-video diffusion transformer generates annotated data

Information enhancement strategy improves sample content

Uncertainty-based label smoothing reduces low-quality impact

🔎 Similar Papers

A Comprehensive Review of Few-shot Action Recognition