🤖 AI Summary
Existing video classification methods naively average frame-level embeddings from pretrained Transformer encoders, neglecting critical temporal structures—including event ordering, dynamic feature importance, and duration variability. While mainstream temporal modeling approaches require architectural modifications and full retraining, they are incompatible with already fine-tuned large models. This paper proposes an encoder-agnostic, lightweight temporal matching framework: it maps fixed-length embeddings into variable-length multivariate time series and introduces a learnable per-frame, per-feature weighting mechanism. Inspired by time-series alignment, the framework employs a dedicated neural architecture for temporal modeling. It adds fewer than 1.8% parameters and requires no encoder modification or retraining. Evaluated on Something-Something V2, Kinetics-400, and HMDB51, our method achieves 77.2%, 89.1%, and 88.6% Top-1 accuracy, respectively, with training completed in under three hours.
📝 Abstract
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of 77.2% on Something-Something V2, 89.1% on Kinetics-400, and 88.6% on HMDB51, while adding fewer than 1.8% additional learnable parameters and requiring less than 3 hours of training time. Our code is available at https://github.com/darrylho/DejaVid.