Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

📅 2024-08-22

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

219K/year

🤖 AI Summary

In multi-agent settings, large language models (LLMs) struggle to generate reward functions that simultaneously ensure subgroup fairness and decision effectiveness. Method: We propose the LLM-Adjudicator collaborative framework, which integrates a configurable, transparent external adjudicator module grounded in social choice theory and social welfare functions to explicitly model multi-objective trade-offs—thereby overcoming the opacity and bias inherent in end-to-end LLM-based reward design. The framework further incorporates restless bandit modeling for dynamic resource allocation, multi-objective optimization, and prompt engineering to support real-world applications such as public health decision-making. Contribution/Results: Experiments demonstrate that our approach significantly outperforms pure-LLM baselines across three key dimensions: reward effectiveness, alignment with human intent, and cross-subgroup balance—establishing a principled, interpretable alternative to black-box reward generation in fairness-critical multi-agent systems.

Technology Category

Application Category

📝 Abstract

LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

Problem

Research questions and friction points this paper is trying to address.

Multi-agent systems

Reward optimization

Resource allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Social Choice Language Model

Resource Allocation

Fairness in Reward Systems

🔎 Similar Papers

LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits