๐ค AI Summary
Existing Video-LLMs rely on uniform frame sampling, which often overlooks salient โsurpriseโ events critical to video narrative structure, leading to flawed semantic understanding. To address this, we propose SPIKE, the first framework to integrate Bayesian surprise into Video-LLM inference. SPIKE dynamically models conflicts between visual inputs and internal prior beliefs, enabling adaptive, query-agnostic frame selection. Its core components include surprise-driven belief updating, zero-shot prior calibration, GRPO-based reinforcement learning for optimization, and surprise-weighted sampling. Evaluated across five downstream video understanding tasks, SPIKE consistently outperforms uniform sampling baselines. Moreover, its detection of positive and negative surprise events exhibits strong agreement with human judgments (Pearsonโs r > 0.89), substantiating both its cognitive plausibility and practical efficacy.
๐ Abstract
Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.