Learning Steerable Clarification Policies with Collaborative Self-play

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

AI assistants struggle to adaptively select response strategies—direct answering, intent enumeration, or clarification initiation—under ambiguous user queries. Method: This paper proposes a controllable clarification strategy training framework based on a collaborative dual-agent self-play mechanism to generate diverse dialogue data, optimized end-to-end via Reinforced Self-Training (ReST) with a reward objective of “accuracy under cost penalty.” The policy dynamically adjusts behavior according to quantified interaction costs (e.g., latency, device constraints) and generalizes to unseen cost values. Contribution/Results: Experiments demonstrate significant improvements in both reward and accuracy across multiple cost settings. The model exhibits strong controllability, cross-modal and cross-device adaptability, and markedly enhances interactive efficiency for ambiguous requests—without requiring task-specific fine-tuning or external supervision.

Technology Category

Application Category

📝 Abstract

To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.

Problem

Research questions and friction points this paper is trying to address.

Develops steerable policies for AI assistants to manage uncertainty in ambiguous queries

Trains models using collaborative self-play to optimize cost-penalized accuracy in responses

Generalizes policies to unseen cost values for adaptable clarification strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training steerable policies using collaborative self-play

Using cost-penalized accuracy as reward for reinforcement learning

Generalizing to unseen cost values with Reinforced Self-Training

🔎 Similar Papers

A Survey on Self-play Methods in Reinforcement Learning