TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of multimodal large language models (MLLMs) in long-video understanding—namely, constrained context length and prohibitive training costs—existing unsupervised frame sampling methods often miss critical events and lack optimization capability. This paper proposes TSPO, a trainable temporal sampling framework grounded in reinforcement learning. Its core contributions are threefold: (1) an event-aware temporal agent that jointly models question-answering responses and key-frame selection; (2) an end-to-end, differentiable relative optimization paradigm; and (3) a rule-guided sparse reward mechanism coupled with a “needle-in-a-haystack” video data construction strategy to emphasize rare but semantically pivotal events. Evaluated on multiple long-video question-answering benchmarks, TSPO achieves state-of-the-art performance and demonstrates strong generalization across diverse video-MLLM architectures.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models' event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO's training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Optimizing sparse frame sampling for long videos in MLLMs
Addressing unsupervised, non-differentiable video frame selection challenges
Improving event-aware keyframe selection via reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes video frame sampling
Event-aware agent selects keyframes probabilistically
Rule-based rewards enhance temporal sampling policy
🔎 Similar Papers
No similar papers found.
Canhui Tang
Canhui Tang
Xi'an Jiaotong University
Computer Vision
Z
Zifan Han
Xi’an Jiaotong University, Institute of Artificial Intelligence (TeleAI), China Telecom
H
Hongbo Sun
Institute of Artificial Intelligence (TeleAI), China Telecom
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
X
Xuchong Zhang
Xi’an Jiaotong University
Xin Wei
Xin Wei
Schmidt AI in Science Postdoc, University of Michigan
Natural HazardsAI for GeohazardsResilienceRiskReliability
Y
Ye Yuan
Institute of Artificial Intelligence (TeleAI), China Telecom
Huayu Zhang
Huayu Zhang
Senior Engineer, Huawei Technologies Co., Ltd
Distributed SystemNetwork ScienceMachine LearningOptimizationGraph Theory
J
Jinglin Xu
University of Science and Technology Beijing
H
Hao Sun
Institute of Artificial Intelligence (TeleAI), China Telecom