🤖 AI Summary
Current process reward modeling (PRM) for large language model (LLM)-based machine translation (MT) lacks systematic methodology and dedicated evaluation benchmarks. Method: We propose the first end-to-end reward modeling framework tailored for MT, featuring (i) an approximate Monte Carlo Tree Search (MCTS)-based method for automatic token-level preference pair generation—reducing annotation cost; (ii) the first MT-specific PRM benchmark; and (iii) a systematic architecture comparison with test-time alignment and hypothesis ensemble mechanisms. Contributions/Results: Our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance on both token-level and sequence-level evaluations. It supports zero-shot test-time alignment without additional training and significantly improves translation consistency and robustness across diverse domains and error types.
📝 Abstract
Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce extbf{MT-RewardTree}, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in href{https://sabijun.github.io/MT_RewardTreePage/}{https://sabijun.github.io/MT_RewardTreePage}.