🤖 AI Summary
This paper addresses topic modeling for multimodal documents containing heterogeneous textual content (short and long text) and multiple images per document. Methodologically, it proposes an interpretable cross-modal topic model that: (1) leverages a fine-tuned vision-language model to extract context-aware joint image-text embeddings; (2) introduces a distributional attention mechanism that dynamically weights token- and patch-level contributions while avoiding redundant image encoding; and (3) incorporates a topic-distribution-driven cross-modal reconstruction objective to explicitly model word-topic and document-topic distributions. The key contribution is the first efficient and interpretable joint topic inference framework for multi-image documents. Evaluated on six benchmarks, the model significantly outperforms state-of-the-art methods, achieving an average LLM-based evaluation score of 2.61. It demonstrates superior performance in few-shot retrieval and semantic modeling of scientific literature.
📝 Abstract
We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.