Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders

๐Ÿ“… 2025-05-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the fundamental tension between excessive visual tokens in long-video understanding and the limited context length of language models, this paper proposes Nar-KFC: a novel keyframe-narrative interleaved compression module. It jointly optimizes keyframe selection for both relevance and diversity while explicitly modeling temporal continuity of skipped frames via temporally aligned generative textual narratives. Methodologically, keyframe selection is formulated as an integer quadratic program, solved efficiently via a greedy algorithm; compact multimodal representations are achieved by synergistically integrating off-the-shelf image captioners with multimodal large language models (MLLMs). Evaluated on multiple long-video benchmarks, Nar-KFC significantly boosts the performance of mainstream MLLMsโ€”achieving both high efficiency and strong representational capacity. The code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Employing Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem due to the dilemma between the substantial number of video frames (i.e., visual tokens) versus the limited context length of language models. Traditional uniform sampling often leads to selection of irrelevant content, while post-training MLLMs on thousands of frames imposes a substantial computational burden. In this paper, we propose threading keyframes with narratives (Nar-KFC), a plug-and-play module to facilitate effective and efficient long video perception. Nar-KFC generally involves two collaborative steps. First, we formulate the keyframe selection process as an integer quadratic programming problem, jointly optimizing query-relevance and frame-diversity. To avoid its computational complexity, a customized greedy search strategy is designed as an efficient alternative. Second, to mitigate the temporal discontinuity caused by sparse keyframe sampling, we further introduce interleaved textual narratives generated from non-keyframes using off-the-shelf captioners. These narratives are inserted between keyframes based on their true temporal order, forming a coherent and compact representation. Nar-KFC thus serves as a temporal- and content-aware compression strategy that complements visual and textual modalities. Experimental results on multiple long-video benchmarks demonstrate that Nar-KFC significantly improves the performance of popular MLLMs. Code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing long video understanding with MLLMs
Optimizing keyframe selection for relevance and diversity
Mitigating temporal discontinuity with textual narratives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyframe selection via integer quadratic programming
Greedy search for efficient keyframe selection
Interleaved textual narratives from non-keyframes
๐Ÿ”Ž Similar Papers
No similar papers found.