🤖 AI Summary
Existing MLLM-based video moment localization methods are constrained by the text-generation paradigm, which lacks frame-level gradients and thus suffers from insufficient fine-grained localization accuracy. This paper introduces, for the first time, a frame-level binary segmentation paradigm for MLLMs: the model’s output is directly formulated as a per-frame “0/1” sequence, unifying language understanding and pixel-level localization in an end-to-end manner. By jointly optimizing segmentation loss and causal language modeling loss, our approach overcomes the fundamental limitation of generative modeling—its absence of explicit frame-level supervision. The method integrates frame-sequence prompting, logit mapping, and beam-search decoding. On QVHighlights, it achieves 56.74% HIT@1 for highlight detection and 35.28 mAP for moment retrieval, while sampling only 25 frames—approximately half that of current SOTA methods—and exhibits improved training stability.
📝 Abstract
Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.