Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of imitating a non-stationary learner whose behavior shifts from exploration to exploitation, under the restrictive setting where only action sequences are observable and no reward signals are available. To tackle this, the authors propose a two-stage suffix imitation framework that discards the initial exploratory data and performs empirical risk minimization solely on the subsequent exploitation-phase actions. This approach is the first to achieve asymptotic efficiency comparable to that of a full-information learner without access to rewards, demonstrating that optimal policies and underlying problem parameters can be effectively recovered from action observations alone. Remarkably, the method attains a convergence rate of $\tilde O(1/\sqrt{N})$ despite the severe information deficiency inherent in the setting.

Technology Category

Application Category

📝 Abstract

We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

Problem

Research questions and friction points this paper is trying to address.

Inverse Contextual Bandits

Non-Stationary Learner

Reward-Free Inference

Action-Only Observation

Policy Recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Contextual Bandits

Reward-Free Learning

Non-Stationary Behavior