Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Large language models (LLMs) typically optimize only point estimates in regression tasks, hindering their ability to produce well-calibrated predictive distributions and limiting applications such as uncertainty quantification and ranking. This work proposes a novel policy-based reinforcement learning approach that, for the first time, directly optimizes the entire predictive distribution. The method constructs an empirical distribution from multiple decoding samples, evaluates distributional quality using the Continuous Ranked Probability Score (CRPS), and introduces a leave-one-out marginal contribution reward mechanism to jointly enhance accuracy and dispersion, effectively mitigating diversity collapse. Experiments demonstrate substantial improvements over supervised fine-tuning and pointwise reinforcement learning baselines across Gaussian mixture, code performance, and molecular property prediction tasks. Notably, it achieves a 6-point gain in Spearman correlation on the KBSS benchmark and matches graph neural network performance on MoleculeNet using only SMILES strings.

📝 Abstract

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

Problem

Research questions and friction points this paper is trying to address.

predictive distributions

regression

uncertainty estimation

large language models

calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-Aware Reward

Predictive Distributions

Reinforcement Learning