Adapter-Augmented Bandits for Online Multi-Constrained Multi-Modal Inference Scheduling

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of online multimodal large language model inference scheduling, which include modality heterogeneity, dynamically varying inference difficulty, and multidimensional resource constraints. The authors propose M-CMAB, a novel framework that integrates a multi-adapter architecture with a contextual multi-armed bandit formulation under multidimensional knapsack constraints. By freezing the backbone network and employing a CLS attention-based predictor to extract task semantics, M-CMAB leverages lightweight task-specific adapters for efficient representation. A dual constraint handler maintains Lagrange multipliers to enforce long-term resource budgets, while a two-stage budget-aware scheduler balances exploration and exploitation. Theoretical analysis provides regret bounds, and experiments demonstrate that M-CMAB significantly outperforms existing methods on heterogeneous multimodal benchmarks, achieving up to a 14.18% improvement in cumulative reward while closely approaching the ideal upper bound.

Technology Category

Application Category

📝 Abstract
Multi-modal large language model (MLLM) inference scheduling enables strong response quality under practical and heterogeneous budgets, beyond what a homogeneous single-backend setting can offer. Yet online MLLM task scheduling is nontrivial, as requests vary sharply in modality composition and latent reasoning difficulty, while execution backends incur distinct, time-varying costs due to system jitter and network variation. These coupled uncertainties pose two core challenges: deriving semantically faithful yet scheduling-relevant multi-modal task representations, and making low-overhead online decisions over irreversible multi-dimensional budgets. Accordingly, we propose \emph{M-CMAB} (\underline{M}ulti-modal \underline{M}ulti-constraint \underline{C}ontextual \underline{M}ulti-\underline{A}rmed \underline{B}andit), a multi-adapter-enhanced MLLM inference scheduling framework with three components: (i) a CLS-attentive, frozen-backbone \emph{Predictor} that extracts compact task representations and updates only lightweight adapters for action-specific estimation; (ii) a primal-dual \emph{Constrainer} that maintains online Lagrange multipliers to enforce long-horizon constraints via per-round objectives; and (iii) a two-phase \emph{Scheduler} that balances exploration and exploitation under irreversible budgets. We establish a regret guarantee under multi-dimensional knapsack constraints. On a composite multimodal benchmark with heterogeneous backends, \emph{M-CMAB} consistently outperforms state-of-the-art baselines across budget regimes, achieving up to 14.18% higher reward and closely tracking an oracle-aided upper bound. Codes are available at https://anonymous.4open.science/r/M2CMAB/.
Problem

Research questions and friction points this paper is trying to address.

multi-modal inference
online scheduling
multi-constrained optimization
resource budget
task representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal LLM
Adapter-based representation
Multi-constrained bandits
Online inference scheduling
Primal-dual optimization
X
Xianzhi Zhang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Y
Yue Xu
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Yinlin Zhu
Yinlin Zhu
Sun Yat-sen University
Graph Neural NetworksFederated Learning
Di Wu
Di Wu
Professor of Computer Science, Sun Yat-Sen University
networkingmultimedia communicationdistributed computing
Y
Yipeng Zhou
School of Computing, Macquarie University, NSW 2109, Australia
Miao Hu
Miao Hu
Sun Yat-Sen University
edge computingmachine learningAR/VR4K/8K
G
Guocong Quan
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China