Moment and Highlight Detection via MLLM Frame Segmentation

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM-based video moment localization methods are constrained by the text-generation paradigm, which lacks frame-level gradients and thus suffers from insufficient fine-grained localization accuracy. This paper introduces, for the first time, a frame-level binary segmentation paradigm for MLLMs: the model’s output is directly formulated as a per-frame “0/1” sequence, unifying language understanding and pixel-level localization in an end-to-end manner. By jointly optimizing segmentation loss and causal language modeling loss, our approach overcomes the fundamental limitation of generative modeling—its absence of explicit frame-level supervision. The method integrates frame-sequence prompting, logit mapping, and beam-search decoding. On QVHighlights, it achieves 56.74% HIT@1 for highlight detection and 35.28 mAP for moment retrieval, while sampling only 25 frames—approximately half that of current SOTA methods—and exhibits improved training stability.

Technology Category

Application Category

📝 Abstract
Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
Problem

Research questions and friction points this paper is trying to address.

Detects video moments and highlights from natural-language queries.
Uses MLLM for frame-level segmentation via token probabilities.
Improves efficiency with fewer frames and strong performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses segmentation loss on LLM output tokens
Encodes frames as binary character sequences
Combines causal LM and segmentation objectives
I
I Putu Andika Bagas Jiwanta
School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Indonesia
Ayu Purwarianti
Ayu Purwarianti
Associate Professor, Informatics, Institut Teknologi Bandung, Indonesia
Computational LinguisticsMachine Learning