MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current process reward modeling (PRM) for large language model (LLM)-based machine translation (MT) lacks systematic methodology and dedicated evaluation benchmarks. Method: We propose the first end-to-end reward modeling framework tailored for MT, featuring (i) an approximate Monte Carlo Tree Search (MCTS)-based method for automatic token-level preference pair generation—reducing annotation cost; (ii) the first MT-specific PRM benchmark; and (iii) a systematic architecture comparison with test-time alignment and hypothesis ensemble mechanisms. Contributions/Results: Our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance on both token-level and sequence-level evaluations. It supports zero-shot test-time alignment without additional training and significantly improves translation consistency and robustness across diverse domains and error types.

Technology Category

Application Category

📝 Abstract

Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce extbf{MT-RewardTree}, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in href{https://sabijun.github.io/MT_RewardTreePage/}{https://sabijun.github.io/MT_RewardTreePage}.

Problem

Research questions and friction points this paper is trying to address.

Develops MT-RewardTree for process reward models in machine translation.

Introduces token-level preference pairs via Monte Carlo Tree Search.

Establishes first MT-specific reward model benchmark for evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates token-level preference pair generation using MCTS.

Establishes first MT-specific reward model benchmark.

Enables test-time alignment without additional training.

🔎 Similar Papers

No similar papers found.