What you reward is what you learn: Comparing rewards for online speech policy optimization in public HRI

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This study addresses the challenge of designing effective and socially acceptable dialogue strategies for social robots in open public environments. The authors formulate online speech strategy optimization as a contextual multi-armed bandit problem and employ Thompson sampling to dynamically select among six combinations of speech rate and utterance verbosity during real-world interactions. For the first time, they systematically compare the efficacy of three distinct user feedback signals in guiding policy learning. Drawing on over 1,400 real human–robot interactions across 12 days, the research demonstrates that different reward signals significantly influence policy convergence and user behavior. Through offline contextual analysis, the study distills actionable design principles, offering empirical evidence and methodological support for optimizing interactive strategies in public settings.

Technology Category

Application Category

📝 Abstract

Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.

Problem

Research questions and friction points this paper is trying to address.

online policy optimization

speech policy

public HRI

multi-armed bandit

reward design

Innovation

Methods, ideas, or system contributions that make the work stand out.

online policy optimization

multi-armed bandit

Thompson sampling