Online Learning of Optimal Sequential Testing Policies

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper addresses online sequential testing under unknown test outcome distributions: given a stream of individuals, one must dynamically select subsets from a correlated, high-cost pool of candidate tests, balancing information gain against testing cost. Due to partial testing, missing data arises—rendering the problem strictly harder than standard MDPs and invalidating conventional regret bounds. Theoretically, we show that missingness elevates the minimax regret lower bound to Ω(T^{2/3}). To overcome this, we propose an “explore-then-commit” framework coupled with an iterative elimination algorithm, integrating MDP modeling with maximum-entropy sampling and refining the reward structure to break the natural lower bound. Our approach achieves two distinct regret guarantees: ilde{O}(T^{2/3}) under general conditions and ilde{O}(sqrt{T}) under favorable structural assumptions. Both theoretical analysis and empirical evaluation validate the efficacy and robustness of the proposed methods.

Technology Category

Application Category

📝 Abstract

This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as $Ω(T^{frac{2}{3}})$, in contrast to the $Θ(sqrt{T})$ rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is $ ilde{O}(T^{frac{2}{3}})$ for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves $ ilde{O}(sqrt{T})$ regret, breaking the $Ω(T^{frac{2}{3}})$ lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration--exploitation trade-off under missing data and guides the design of efficient sequential testing policies.

Problem

Research questions and friction points this paper is trying to address.

Online learning of optimal sequential testing policies with costly tests

Handling missing data bias in online learning of test distributions

Achieving minimax regret bounds for exploration-exploitation trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online learning of sequential testing policies

Explore-Then-Commit algorithm with O(T^{2/3}) regret

Iterative-elimination for missingness-independent rewards

🔎 Similar Papers

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning