Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses the limitations of existing code reward models, which predominantly rely on execution-based feedback and focus narrowly on functional correctness, thereby failing to support multidimensional and multilingual code quality assessment. To overcome this, we introduce Themis-CodeRewardBench—the first code reward benchmark encompassing five quality dimensions (e.g., readability, efficiency) and eight programming languages—and release Themis-CodePreference, the largest open-source code preference dataset to date, comprising over 350,000 preference pairs. Leveraging a Transformer architecture, we train multilingual reward models spanning 600M to 32B parameters, incorporating multi-criterion preference learning and cross-lingual transfer strategies. Experimental results demonstrate that our models significantly outperform more than 50 baselines in scalability, cross-lingual generalization, and reliability across multiple evaluation criteria.
📝 Abstract
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
Problem

Research questions and friction points this paper is trying to address.

reward models
code generation
multilingual
multi-criteria scoring
functional correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual code reward models
multi-criteria scoring
code preference dataset
cross-lingual transfer
reward model scaling
🔎 Similar Papers