VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge that Video Multimodal Large Language Models (VMLLMs) struggle to capture brief actions and rare transient events in long videos, this paper proposes a novel approach to enhance temporal fine-grained perception. The method introduces three key innovations: (1) synthesizing video samples with deliberately omitted keyframes to emphasize transient dynamics; (2) a two-stage training framework integrating supervised fine-tuning with reinforcement learning using frame-replacement negative samples; and (3) a relative reward mechanism coupled with an auxiliary contrastive loss to explicitly align visual intermediate representations with semantic keywords. Experimental results demonstrate significant improvements over state-of-the-art methods on fine-grained action understanding and rare event description tasks, while maintaining competitive performance on standard video understanding benchmarks.

Technology Category

Application Category

📝 Abstract
We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.
Problem

Research questions and friction points this paper is trying to address.

Addressing VMLLMs' limited reasoning about brief actions in short clips
Improving detection of rare transient events in long video sequences
Enhancing sensitivity to fine-grained motion cues in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with key-frame replacement
Auxiliary contrastive loss for motion sensitivity
Relative reward for precise action recovery
🔎 Similar Papers
F
Fufangchen Zhao
State Key Laboratory of Networking and Switching Technology, BUPT
L
Liao Zhang
Independent Researcher
D
Daiqi Shi
Independent Researcher
Yuanjun Gao
Yuanjun Gao
Independent Researcher
C
Chen Ye
Independent Researcher
Yang Cai
Yang Cai
Professor of Computer Science and Economics, Yale University
Theoretical Computer ScienceAlgorithmic Game TheoryMechanism DesignLearning
J
Jian Gao
State Key Laboratory of Networking and Switching Technology, BUPT
D
Danfeng Yan
State Key Laboratory of Networking and Switching Technology, BUPT