Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the misalignment between visual features and linguistic semantics in traditional multimodal video summarization—particularly the difficulty of matching CNN-based classification features with natural language expressions—by proposing ClipSum, a novel framework that, for the first time, incorporates frozen CLIP vision–language representations into video summarization. Without fine-tuning CLIP, ClipSum achieves semantic alignment through explicit temporal modeling, a dimension-adaptive fusion mechanism, and a Transformer-based text decoder. Evaluated on the YouCook2 dataset, ClipSum attains a ROUGE-1 score of 33.0% using only one-quarter the feature dimensionality of conventional approaches, significantly outperforming both ResNet-152 (30.5%) and fine-tuned CLIP (32.3%), thereby demonstrating that semantic alignment via pretrained vision–language models surpasses paradigms relying on high-dimensional trainable features or task-specific fine-tuning.

📝 Abstract

Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum

Problem

Research questions and friction points this paper is trying to address.

multimodal summarization

vision-language alignment

instructional videos

semantic gap

visual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

semantic alignment

frozen features