Swift Sampling: Selecting Temporal Surprises via Taylor Series

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses frame redundancy in long videos by proposing a lightweight, training-free frame sampling method inspired by the brain’s predictive coding mechanism, requiring neither auxiliary networks nor video-specific hyperparameter tuning. The approach models a video as a differentiable trajectory in a visual latent space and employs Taylor expansion to predict the evolution of frames along this trajectory. Sampling is achieved by identifying “temporal surprises”—frames that significantly deviate from the predicted path—based on analyses of velocity and acceleration in the feature trajectory. With computational overhead merely 0.02× that of baseline methods (30× lower), the technique substantially outperforms query-agnostic strategies such as uniform sampling across three long-form video question-answering benchmarks and ten downstream tasks, achieving up to a 12.5 percentage point accuracy gain under constrained frame budgets.

📝 Abstract

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Problem

Research questions and friction points this paper is trying to address.

temporal surprises

video frame sampling

long-form video

predictive deviation

information-rich moments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swift Sampling

temporal surprise

Taylor expansion

training-free frame selection

video latent trajectory

🔎 Similar Papers

No similar papers found.