Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work extends Interaction-Grounded Learning from single-step to multi-step contextual Markov Decision Processes (MDPs) in realistic settings where explicit rewards are unavailable and only indirect feedback—generated by an unknown mechanism—is observable. To address this challenge, the authors develop an implicit reward estimator tailored for sequential decision-making and integrate it with an Inverse Gap Weighting (IGW)-based policy optimization framework, enabling the decoding of personalized objectives from multi-round interactions. Theoretical analysis establishes a sublinear regret bound for the proposed approach, demonstrating its asymptotic efficiency. Empirical validation on both synthetic turn-based MDPs and real-world user booking datasets confirms the method’s effectiveness in learning from implicit, interaction-derived signals without access to ground-truth rewards.

Technology Category

Application Category

📝 Abstract

In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.

Problem

Research questions and friction points this paper is trying to address.

Interaction-Grounded Learning

Contextual Markov Decision Processes

Personalized Feedback

Sequential Decision-Making

Indirect Feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interaction-Grounded Learning

Contextual MDPs

Personalized Feedback