CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling

📅 2024-10-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Reward hacking in large language model (LLM) reward modeling—e.g., over-optimization of superficial features such as redundant bullet points or excessive verbosity—undermines alignment fidelity. Method: We propose a context-aware dynamic criterion generation mechanism that, conditioned on the user query, instantiates strong-relevance evaluation criteria (e.g., logical consistency, clarity) to constrain reward signals; we provide theoretical guarantees for its efficacy against reward hacking. Our framework integrates adaptive criterion generation, preference dataset construction, RLHF training, and lightweight model distillation. Results: On zero-shot Reward Bench, our method improves performance by 2.1%; on Mistral-Base (7B), it achieves LC-WR 22.5% and WR 21.1%, setting a new state-of-the-art in alignment performance. This work introduces the first dynamic criterion generation paradigm, uniquely balancing alignment quality and computational efficiency.

Technology Category

Application Category

📝 Abstract

Reward modeling in large language models is susceptible to reward hacking, causing models to latch onto superficial features such as the tendency to generate lists or unnecessarily long responses. In reinforcement learning from human feedback (RLHF) and more generally during post-training flawed reward signals often lead to outputs that optimize for these spurious correlates instead of genuine quality or correctness. We propose Context-Aware Reward Modeling (CARMO), a novel approach that first generates dynamic, context-relevant criteria to ground the reward model before producing reward scores. Unlike prior methods that rely on static rubrics, CARMO leverages large language models (LLMs) to adaptively create evaluation criteria such as logical consistency, clarity, and depth tailored to the user query. Our theoretical analysis shows that such criteria generation can mitigate reward hacking. We further demonstrate that CARMO can be distilled into smaller models, reducing the computational cost of alignment. We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench. Furthermore, alignment performed on the CARMO-curated preference dataset achieves 22.5% and 21.1% LC-WR and WR, respectively, on Mistral-Base (7B).

Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in large language models

Proposes context-aware dynamic criteria generation

Improves zero-shot performance in generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic criteria generation

Mitigates reward hacking

Reduces computational cost

🔎 Similar Papers

A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving