KARMA: Karma-Aligned Reward Model Adaptation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language models struggle to capture pragmatic behaviors that depend on context, tone, and social norms. This work proposes a method that avoids exposing downstream models directly to social media data by training a context-conditional reward model on large-scale Reddit interaction data and using it to fine-tune language models via reinforcement learning. Experiments reveal that, despite the reward model’s poor performance in predicting upvote counts, it substantially enhances the model’s pragmatic competence while mitigating undesirable side effects. However, factual accuracy consistently declines across all configurations, exposing an inherent tension between the employed reward signal and truthfulness. These findings highlight a fundamental trade-off in pragmatic modeling between leveraging social signals and maintaining factual fidelity.

📝 Abstract

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.

Problem

Research questions and friction points this paper is trying to address.

pragmatics

social signals

reward modeling

language model alignment

factuality

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward model adaptation

context-sensitive conversation

pragmatics-mediated tasks