CEMTM: Contextual Embedding-based Multimodal Topic Modeling

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses topic modeling for multimodal documents containing heterogeneous textual content (short and long text) and multiple images per document. Methodologically, it proposes an interpretable cross-modal topic model that: (1) leverages a fine-tuned vision-language model to extract context-aware joint image-text embeddings; (2) introduces a distributional attention mechanism that dynamically weights token- and patch-level contributions while avoiding redundant image encoding; and (3) incorporates a topic-distribution-driven cross-modal reconstruction objective to explicitly model word-topic and document-topic distributions. The key contribution is the first efficient and interpretable joint topic inference framework for multi-image documents. Evaluated on six benchmarks, the model significantly outperforms state-of-the-art methods, achieving an average LLM-based evaluation score of 2.61. It demonstrates superior performance in few-shot retrieval and semantic modeling of scientific literature.

Technology Category

Application Category

📝 Abstract
We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.
Problem

Research questions and friction points this paper is trying to address.

Inferring coherent topic structures from multimodal documents
Processing multiple images per document without repeated encoding
Maintaining interpretability through explicit topic distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextualized embeddings from fine-tuned LVLMs
Distributional attention mechanism weighting tokens
Reconstruction objective aligning multimodal representations
🔎 Similar Papers
2024-04-02North American Chapter of the Association for Computational LinguisticsCitations: 2
2024-06-13arXiv.orgCitations: 0