MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of low-quality and temporally incoherent topic modeling in long-form videos, which stem from difficulties in effectively fusing multimodal information. To overcome this, the authors propose a modular tri-modal topic modeling approach based on a similarity-gated fusion mechanism that jointly integrates speech recognition transcripts, audio embeddings, and visual features, followed by BERTopic-based clustering. The method employs a deterministic gating strategy to enable efficient cross-lingual and cross-modal fusion, substantially enhancing both topic coherence and temporal stability. Experiments on German and English news videos demonstrate significant improvements: noise is reduced to 0.06, topic switching rate drops to 0.21, normalized entropy reaches 0.92, and clustering effectiveness increases by 5–12 times. The authors release the source code and a multimodal topic corpus comprising 54 hours of human-verified annotations.
📝 Abstract
We introduce MMTM, a modular pipeline for topic discovery in long-form video that integrates speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, joint tri-modal modeling substantially improves topic quality: noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. We release the pipeline code and a human-validated 54-hour multimodal video topic corpus with dual-annotator visual evaluation and LLM-assisted labeling.
Problem

Research questions and friction points this paper is trying to address.

topic modeling
long-form video
tri-modal fusion
multimodal analysis
temporal coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-modal fusion
similarity-gated fusion
topic modeling
long-form video
BERTopic