Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing image-based pre-trained models struggle to capture mid-temporal-frequency motion when transferred to video tasks, often modeling only static features or extremely rapid changes, which limits their performance on fine-grained action recognition. This work proposes the first frequency-aware adapter for parameter-efficient fine-tuning, integrating spectral analysis into the adaptation process. By applying a fast Fourier transform along the temporal dimension to decompose input signals into frequency components, the method introduces learnable frequency-band-specific embeddings that adaptively enhance discriminative spectral features. Evaluated on five fine-grained action recognition benchmarks, the proposed approach consistently outperforms existing parameter-efficient fine-tuning methods and surpasses full-model fine-tuning on four of them, demonstrating its effectiveness in leveraging temporal frequency information for video understanding.

Technology Category

Application Category

📝 Abstract

Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.

Problem

Research questions and friction points this paper is trying to address.

fine-grained video understanding

temporal dynamics

multi-scale motion

image-to-video adaptation

frequency analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral adapters

frequency-aware adaptation

fine-grained action recognition

FFT-based temporal modeling

vision foundation models

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4