Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This study addresses the online multi-objective resource selection problem, where an agent may probe $q$ candidate resources before execution but can commit to only one—situating the setting between the classical bandit and full-information expert models. To tackle this, the authors propose the PtC-P-UCB algorithm, which guides probing via a hypervolume-inspired frontier coverage potential and makes commitment decisions based on marginal hypervolume gain. The work establishes the first theoretical framework for multi-objective bandits under limited probing, revealing a $1/\sqrt{q}$ acceleration effect and extending the analysis to multimodal probing scenarios. The resulting bounds include a Pareto hypervolume error of $\widetilde{O}(KP d/\sqrt{qT})$ and a scalarized regret of $\widetilde{O}(L_\phi d\sqrt{(K/q)T})$, demonstrating that limited probing significantly enhances learning efficiency.

Technology Category

Application Category

📝 Abstract

We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of $\tilde{O} (K_P d/\sqrt{qT})$, where $K_P$ is the Pareto-frontier size and $T$ is the horizon, and scalarized regret $\tilde{O} (L_\phi d\sqrt{(K/q)T})$, where $\phi$ is the scalarizer. These quantify a transparent $1/\sqrt{q}$ acceleration from limited probing. We further extend to \emph{multi-modal probing}: each probe returns $M$ modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

Problem

Research questions and friction points this paper is trying to address.

multi-objective bandits

probe-then-commit

limited feedback

Pareto frontier

online resource selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probe-then-Commit

Multi-objective bandits

Pareto frontier