SPIKE-RL: Video-LLMs meet Bayesian Surprise

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing Video-LLMs rely on uniform frame sampling, which often overlooks salient “surprise” events critical to video narrative structure, leading to flawed semantic understanding. To address this, we propose SPIKE, the first framework to integrate Bayesian surprise into Video-LLM inference. SPIKE dynamically models conflicts between visual inputs and internal prior beliefs, enabling adaptive, query-agnostic frame selection. Its core components include surprise-driven belief updating, zero-shot prior calibration, GRPO-based reinforcement learning for optimization, and surprise-weighted sampling. Evaluated across five downstream video understanding tasks, SPIKE consistently outperforms uniform sampling baselines. Moreover, its detection of positive and negative surprise events exhibits strong agreement with human judgments (Pearson’s r > 0.89), substantiating both its cognitive plausibility and practical efficacy.

Technology Category

Application Category

📝 Abstract

Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

Problem

Research questions and friction points this paper is trying to address.

Video-LLMs miss critical surprising events with uniform frame sampling

Quantify Bayesian Surprise to identify belief-updating visual conflicts

Optimize frame allocation to interesting moments for performance gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies Bayesian Surprise as belief update

Optimizes belief hypotheses using GRPO reinforcement learning

Guides surprise-weighted frame sampling for videos

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4