MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the performance degradation of Video-LLMs caused by suboptimal keyframe selection in long videos, this paper proposes MDP3—a training-free, model-agnostic keyframe selection method. MDP3 jointly optimizes for query relevance, list-level diversity, and temporal coherence under a unified framework. It is the first approach to integrate: (i) conditional Gaussian kernels for query-aware frame similarity estimation; (ii) determinantal point processes (DPPs) to model inter-frame diversity; and (iii) a Markov decision process coupled with dynamic programming to enforce temporal contiguity. Theoretical analysis provides a $(1-1/e)$-approximation guarantee for the resulting combinatorial optimization. Extensive experiments on multiple video understanding benchmarks demonstrate that MDP3 significantly outperforms uniform sampling, query-matching heuristics, and other baselines—achieving substantial accuracy gains while maintaining high computational efficiency and strong cross-model generalizability.

Technology Category

Application Category

📝 Abstract

Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a ((1 - 1/e))-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.

Problem

Research questions and friction points this paper is trying to address.

Video Information Retrieval

Diverse Frame Selection

Sequential Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

MDP3

Determinantal Point Process

Video-LLMs Retrieval

🔎 Similar Papers

Frame-Voyager: Learning to Query Frames for Video Large Language Models