Unhackable Temporal Rewarding for Scalable Video MLLMs

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video multimodal large language models (MLLMs) exhibit a “reverse scaling law,” where increased model size and data volume degrade performance—rooted in “temporal hijacking”: overreliance on salient keyframes while neglecting temporal coherence. Method: We formally characterize this phenomenon from a reinforcement learning perspective, introducing Temporal Perplexity (TPL) as an interpretable metric for temporal modeling quality, and propose the Unhijackable Temporal Reward (UTR) framework to align agent objectives with genuine temporal understanding. Our approach seamlessly integrates temporal attention analysis, TPL-based quantification, and UTR-guided gradient correction into standard training pipelines. Contribution/Results: The method effectively suppresses temporal hijacking, yielding consistent performance gains of 12.6–23.4% across multiple video understanding benchmarks. TPL achieves a 0.89 correlation with human temporal annotations, validating its reliability as a principled indicator of temporal modeling fidelity.

Technology Category

Application Category

📝 Abstract
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the"anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit:"temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
Problem

Research questions and friction points this paper is trying to address.

Addresses anti-scaling law in video MLLMs
Identifies and mitigates temporal hacking
Proposes Unhackable Temporal Rewarding framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unhackable Temporal Rewarding framework
Temporal Perplexity score measurement
Reinforcement learning-based temporal alignment
🔎 Similar Papers
No similar papers found.
E
En Yu
Huazhong University of Science and Technology
K
Kangheng Lin
Beijing University of Posts and Telecommunications
L
Liang Zhao
StepFun
Y
Yana Wei
Johns Hopkins University
Zining Zhu
Zining Zhu
Stevens Institute of Technology
Natural Language ProcessingExplainable AI
H
Haoran Wei
StepFun
Jianjian Sun
Jianjian Sun
Researcher of StepFun
LLMMulti-modal
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
X
Xiangyu Zhang
StepFun
J
Jingyu Wang
Beijing University of Posts and Telecommunications
Wenbing Tao
Wenbing Tao
Professor of School of Automation, Huazhong University of Science and Technology
image processingcomputer visionpattern recognition