BLEUBERI: BLEU is a surprisingly effective reward for instruction following

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the reliance of large language model (LLM) alignment on costly human annotations and strong pretrained reward models. We propose a lightweight, efficient paradigm that directly employs parameter-free string-matching metrics—such as BLEU—as reinforcement learning reward signals. Methodologically, we integrate hard-instruction identification with Groupwise Relative Policy Optimization (GRPO), requiring only high-quality reference outputs and eliminating the need for reward model training. To our knowledge, this is the first empirical demonstration that BLEU achieves human-preference-level discriminative capability in instruction-following tasks. Experiments across four benchmarks and three base models show that our approach matches the performance of mainstream reward-model-guided RL alignment methods; human evaluations confirm comparable output quality and significantly improved factual consistency. Moreover, training cost and computational overhead are substantially reduced.

Technology Category

Application Category

📝 Abstract

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.

Problem

Research questions and friction points this paper is trying to address.

Can simpler metrics replace costly reward models for LLM alignment?

Does BLEU effectively match human preferences in instruction-following tasks?

Is BLEUBERI competitive with reward model-guided RL in output quality?

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses BLEU as reward for RL alignment

Identifies challenging instructions first

Applies Group Relative Policy Optimization

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance