🤖 AI Summary
To address the scarcity of paired multimodal data and high computational overhead in joint multimodal processing, this paper proposes a novel “Modality-as-Language” paradigm, unifying pairwise translation among speech, image, and text modalities as cross-modal machine translation tasks. Methodologically, (1) discrete tokenization of each modality is achieved via techniques such as VQ-VAE, establishing a unified sequence interface across modalities; and (2) a shared-parameter encoder–decoder architecture is designed to decouple modality-specific representations from general translation processes. Evaluated on all six possible cross-modal translation directions, the model consistently outperforms unimodal specialized baselines. Empirical results demonstrate substantial improvements in both task performance and training/inference efficiency, validating the effectiveness of task unification for multimodal translation.
📝 Abstract
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.