InfoPO: Information-Driven Policy Optimization for User-Centric Agents

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language model agents often struggle to effectively elicit critical information from ambiguous user requests due to insufficient credit assignment. The authors model multi-turn interactions as a process of actively reducing uncertainty and introduce an information-gain reward mechanism based on counterfactual masked feedback. Integrated within a GRPO framework and augmented with an adaptive variance gating strategy, this approach enables fine-grained credit assignment across interaction steps. Empirical results demonstrate that the proposed method significantly outperforms prompt engineering and existing reinforcement learning baselines in tasks involving intent clarification, collaborative programming, and tool-augmented decision-making, exhibiting superior user collaboration capabilities, transfer robustness, and environmental generalization.

Technology Category

Application Category

📝 Abstract
Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.
Problem

Research questions and friction points this paper is trying to address.

underspecified user requests
credit assignment
multi-turn interaction
information gain
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-Driven Policy Optimization
Information Gain Reward
Credit Assignment
Multi-turn Interaction
Variance-Gated Fusion
🔎 Similar Papers
2024-09-30International Conference on Human-Agent InteractionCitations: 1