MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

๐Ÿ“… 2026-02-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of high computational cost and redundant frame interference in long-form video understanding by proposing MSJoE, a novel framework that jointly optimizes a multimodal large language model with a lightweight keyframe sampler. MSJoE employs question-driven, multi-view query reasoning to generate a query-frame similarity matrix using a frozen CLIP encoder, enabling the sampler to select the most informative keyframes for downstream processing. A reinforcement learningโ€“based co-training mechanism is introduced to facilitate synergistic optimization between the components. Evaluated on multiple long-video question-answering benchmarks, MSJoE achieves an average accuracy improvement of 8.0% over prior methods and surpasses the strongest baseline by 1.1%. Additionally, the authors contribute a new dataset comprising 2.8K videos and 7K question-answer pairs to support future research in this domain.

Technology Category

Application Category

๐Ÿ“ Abstract
Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
Problem

Research questions and friction points this paper is trying to address.

long-form video understanding
multimodal large language models
efficient video processing
key-frame selection
video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

joint evolution
key-frame sampling
multimodal large language model
long-form video understanding
reinforcement learning
๐Ÿ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30