🤖 AI Summary
Scientific knowledge disseminated across multimodal formats—such as research papers, slides, and videos—lacks structured interconnections, hindering fine-grained alignment among concepts, figures, and explanatory content. To address this gap, this work introduces the Multimodal Conference Dataset (MCD), the first unified collection integrating research papers, presentation videos, explanatory videos, and slides, accompanied by a systematic evaluation framework for cross-media fine-grained correspondence. Through comprehensive assessment of embedding models and vision-language models on cross-modal alignment tasks involving text, images, and mathematical formulas, the study reveals that while vision-language models exhibit overall robustness, they underperform on fine-grained alignment; embedding models excel particularly in text-to-visual matching; and mathematical symbols and formulas consistently form distinct clusters in embedding spaces.
📝 Abstract
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD