BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

The globalization of education and the rise of online learning underscore an urgent need for multimodal lecture localization—simultaneously translating speech, slides, and transcripts to preserve auditory, visual, and textual fidelity. To address this, we propose the first end-to-end tri-modal joint translation framework that jointly models lecture audio, slide images, and speaker transcripts. Our approach integrates automatic speech recognition, image-based translation, machine translation, and text-to-speech synthesis, augmented by a slide-aware contextual modeling mechanism that leverages visual-semantic alignment across modalities. Experiments demonstrate substantial improvements in multilingual educational content generation quality and non-native learners’ comprehension. Moreover, the framework effectively supports downstream tasks—including summarization and question answering—without task-specific fine-tuning. All code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present extbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}footnote{All released code and models are licensed under the MIT License.

Problem

Research questions and friction points this paper is trying to address.

Translates lecture audio and slides into multiple languages

Preserves multimodal content including text, slides, and speech

Enhances accessibility for non-native speakers in education

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly translates lecture audio and slides

Produces synchronized outputs across three modalities

End-to-end approach preserves original content entirely

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs