Preference Robustness for DPO with Applications to Public Health

πŸ“… 2025-09-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenging sequential resource allocation problem in public health, characterized by complex objectives and sparse preference data. We propose DPO-PROβ€”a novel algorithm that integrates Direct Preference Optimization (DPO) with lightweight Distributionally Robust Optimization (DRO), enabling efficient and robust reward modeling without self-reflection mechanisms. Our method fine-tunes large language models using natural-language human preferences, significantly enhancing robustness against noisy preference signals. Compared to existing approaches, DPO-PRO achieves lower conservatism while balancing modeling accuracy and inference efficiency. Experiments on a real-world maternal mobile health deployment and standard alignment benchmarks demonstrate performance competitive with self-reflection baselines, yet with substantially reduced inference cost. DPO-PRO thus establishes a scalable new paradigm for value alignment in low-resource settings.

Technology Category

Application Category

πŸ“ Abstract
We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.
Problem

Research questions and friction points this paper is trying to address.

Designing reward functions for sequential resource allocation
Addressing uncertainty in human preference distributions
Improving robustness with reduced conservatism in DPO
Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO-PRO robust fine-tuning algorithm
Lightweight Distributionally Robust Optimization formulation
Reduces conservatism in preference uncertainty handling