ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

We address zero-shot fine-grained video classification—i.e., classifying videos into unseen action categories without any video samples or temporal annotations for those categories. Our method introduces a language-guided sequence alignment paradigm: leveraging large language models to generate ordered sub-action sequences as structured linguistic priors for each action class; extracting frame-level visual embeddings via SigLIP; and aligning sub-action sequences with video frame sequences in a shared cross-modal embedding space using Dynamic Time Warping (DTW). This is the first approach to explicitly integrate interpretable action-structure priors with classical sequence alignment, requiring neither video-text supervision, model fine-tuning, nor additional training. Evaluated on the highly challenging ActionAtlas benchmark, our method achieves 30.5% top-1 accuracy—approaching human performance (61.6%)—and significantly outperforms billion-parameter video-language models while reducing model parameters by approximately 8×.

Technology Category

Application Category

📝 Abstract

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot fine-grained video classification without examples

Capturing temporal structure for distinguishing activities

Aligning generated sub-action sequences with video frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided sequence alignment for video classification

Dynamic Time Warping in shared embedding space

Zero-shot framework without video-text supervision

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs