MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Text-image machine translation (TIMT) requires joint modeling of OCR, vision-language reasoning, and cross-lingual translation; conventional cascaded approaches suffer from error propagation and low efficiency. To address this, we propose MT³-7B-Zero—the first end-to-end TIMT framework built upon a multimodal large language model (MLLM)—integrating multi-task reinforcement learning, a fine-grained hybrid reward mechanism (featuring rule-guided non-binary design), curriculum learning, and RL-based initialization. We introduce XHSPost, the first TIMT benchmark tailored to social media scenarios. On MIT-10M, MT³-7B-Zero achieves state-of-the-art performance, significantly outperforming large models such as Qwen2.5-VL-72B. The model demonstrates strong cross-lingual generalization and robustness in real-world social media settings, empirically validating the feasibility and superiority of end-to-end TIMT.

Technology Category

Application Category

📝 Abstract

Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.

Problem

Research questions and friction points this paper is trying to address.

Scaling end-to-end Text Image Machine Translation (TIMT) via Multi-Task Reinforcement Learning

Addressing TIMT challenges: OCR accuracy, visual-text reasoning, and translation quality

Introducing MT³ framework and XHSPost benchmark for real-world TIMT evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Task RL for MLLM-based TIMT

Multi-mixed reward mechanism for tasks

First social media TIMT benchmark XHSPost

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs