POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
Deploying large language models on edge devices is constrained by limited GPU and DRAM capacity, allowing only a small number of LoRA adapters to reside in memory; loading non-resident adapters on demand incurs significant latency. This work formulates the joint optimization of caching and request routing as a two-timescale contextual bandit problem—the first such formulation—and proposes POLAR, a framework integrating a cache-aware LinUCB router with a periodic cache controller, enhanced by forced exploration and cache optimization strategies. Theoretical analysis establishes a sublinear regret bound, and experiments using Qwen2.5-7B with 15 real-world LoRA adapters demonstrate that POLAR substantially outperforms non-adaptive baselines, with empirical scaling behavior aligning closely with theoretical predictions.

Technology Category

Application Category

📝 Abstract
Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve $\widetilde{\mathcal{O}}(d\sqrt{NT}+\sqrt{KT})$ sublinear regret under stochastic regularity and cacheability conditions, where $N$ is the adapter count, $K$ the cache size, $d$ the context dimension, and $T$ the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.
Problem

Research questions and friction points this paper is trying to address.

LoRA adapter caching
online learning
edge LLM serving
contextual bandit
memory hierarchy
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA adapter caching
online learning
contextual bandits
edge LLM serving
two-timescale optimization