Reinforcement Learning with Action-Triggered Observations

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
研究动作触发观测的强化学习问题,提出ATST-MDP框架和动作序列学习范式,设计ST-LSVI-UCB算法并证明其理论可行性。

Technology Category

Application Category

📝 Abstract
We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret $widetilde O(sqrt{Kd^3(1-γ)^{-3}})$, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement learning with stochastic action-triggered state observations
Formulating ATST-MDPs with tailored Bellman optimality equations
Developing efficient algorithms with theoretical guarantees for sporadic observations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-triggered observations in MDP formulation
Action-sequence learning with linear representations
ST-LSVI-UCB algorithm with statistical guarantees
A
Alexander Ryabchenko
University of Toronto and Vector Institute
Wenlong Mou
Wenlong Mou
University of Toronto
machine learningstatisticsoptimizationapplied probability