TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

📅 2024-02-25

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address the scarcity of paired multimodal data and high computational overhead in joint multimodal processing, this paper proposes a novel “Modality-as-Language” paradigm, unifying pairwise translation among speech, image, and text modalities as cross-modal machine translation tasks. Methodologically, (1) discrete tokenization of each modality is achieved via techniques such as VQ-VAE, establishing a unified sequence interface across modalities; and (2) a shared-parameter encoder–decoder architecture is designed to decouple modality-specific representations from general translation processes. Evaluated on all six possible cross-modal translation directions, the model consistently outperforms unimodal specialized baselines. Empirical results demonstrate substantial improvements in both task performance and training/inference efficiency, validating the effectiveness of task unification for multimodal translation.

Technology Category

Application Category

📝 Abstract

The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.

Problem

Research questions and friction points this paper is trying to address.

Translating between speech, image, and text modalities

Overcoming limited paired multi-modal data challenges

Reducing computational costs in multi-modal learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats modalities as different languages

Tokenizes speech and image into discrete tokens

Uses multi-modal encoder-decoder for translation

🔎 Similar Papers

No similar papers found.