MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of Video-LLMs caused by suboptimal keyframe selection in long videos, this paper proposes MDP3—a training-free, model-agnostic keyframe selection method. MDP3 jointly optimizes for query relevance, list-level diversity, and temporal coherence under a unified framework. It is the first approach to integrate: (i) conditional Gaussian kernels for query-aware frame similarity estimation; (ii) determinantal point processes (DPPs) to model inter-frame diversity; and (iii) a Markov decision process coupled with dynamic programming to enforce temporal contiguity. Theoretical analysis provides a $(1-1/e)$-approximation guarantee for the resulting combinatorial optimization. Extensive experiments on multiple video understanding benchmarks demonstrate that MDP3 significantly outperforms uniform sampling, query-matching heuristics, and other baselines—achieving substantial accuracy gains while maintaining high computational efficiency and strong cross-model generalizability.

Technology Category

Application Category

📝 Abstract
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a ((1 - 1/e))-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
Problem

Research questions and friction points this paper is trying to address.

Video Information Retrieval
Diverse Frame Selection
Sequential Consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

MDP3
Determinantal Point Process
Video-LLMs Retrieval
H
Hui Sun
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Shiyin Lu
Shiyin Lu
Alibaba Group
Multimodal Large Language ModelsOnline LearningBandits
H
Huanyu Wang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning
Z
Zhao Xu
Alibaba International Digital Commerce, Hangzhou, China
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce
M
Ming Li
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China