Regret Guarantees for Linear Contextual Stochastic Shortest Path

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper studies the Linear Contextual Stochastic Shortest Path (Linear Contextual SSP) problem: in each episode, the learner observes an adversarial context, which—via an unknown linear mapping—determines the underlying Markov decision process (MDP); the goal is to reach a target state with minimal cumulative loss, despite unknown transitions, cost functions, and the linear context-to-MDP mapping. To this end, we propose LR-CSSP—the first online learning algorithm capable of handling continuous context spaces while guaranteeing finite-horizon termination in every episode. LR-CSSP integrates linear function approximation, optimistic value iteration, and a refined exploration mechanism to achieve robust control without prior knowledge. Theoretically, it achieves a regret bound of $widetilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_star^2 T_star log(1/delta))$, and under cost lower-bound constraints, attains the optimal $widetilde{O}(sqrt{K})$ rate—significantly advancing both the theoretical understanding and practical applicability of contextual SSP.

Technology Category

Application Category

📝 Abstract

We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner's objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of $widetilde{O}(K^{2/3} d^{2/3} |S| |A|^{1/3} B_star^2 T_star log (1/ δ))$, where $K$ is the number of episodes, $d$ is the context dimension, $S$ and $A$ are the sets of states and actions respectively, $B_star$ bounds the optimal cumulative loss and $T_star$, unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed $ell_{min}$, LR-CSSP attains a regret of $widetilde O(sqrt{K cdot d^2 |S|^3 |A| B_star^3 log(1/δ)/ell_{min}})$. Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.

Problem

Research questions and friction points this paper is trying to address.

Solving linear Contextual Stochastic Shortest Path with unknown MDP dynamics

Achieving sublinear regret when context determines transition and loss functions

Ensuring episode termination despite incomplete environment knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

LR-CSSP algorithm for linear contextual stochastic shortest path

Handles continuous context spaces with linear MDP mapping

Ensures episode termination with sublinear regret guarantees

🔎 Similar Papers

Contextual Linear Optimization with Partial Feedback