InfoPO: Information-Driven Policy Optimization for User-Centric Agents

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that large language model agents often struggle to effectively elicit critical information from ambiguous user requests due to insufficient credit assignment. The authors model multi-turn interactions as a process of actively reducing uncertainty and introduce an information-gain reward mechanism based on counterfactual masked feedback. Integrated within a GRPO framework and augmented with an adaptive variance gating strategy, this approach enables fine-grained credit assignment across interaction steps. Empirical results demonstrate that the proposed method significantly outperforms prompt engineering and existing reinforcement learning baselines in tasks involving intent clarification, collaborative programming, and tool-augmented decision-making, exhibiting superior user collaboration capabilities, transfer robustness, and environmental generalization.

Technology Category

Application Category

📝 Abstract

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

Problem

Research questions and friction points this paper is trying to address.

underspecified user requests

credit assignment

multi-turn interaction

information gain

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-Driven Policy Optimization

Information Gain Reward

Credit Assignment

Multi-turn Interaction

Variance-Gated Fusion

🔎 Similar Papers

Personalisation via Dynamic Policy Fusion

2024-09-30International Conference on Human-Agent InteractionCitations: 1

Authors to Follow