🤖 AI Summary
Deploying large language models on edge devices is constrained by limited GPU and DRAM capacity, allowing only a small number of LoRA adapters to reside in memory; loading non-resident adapters on demand incurs significant latency. This work formulates the joint optimization of caching and request routing as a two-timescale contextual bandit problem—the first such formulation—and proposes POLAR, a framework integrating a cache-aware LinUCB router with a periodic cache controller, enhanced by forced exploration and cache optimization strategies. Theoretical analysis establishes a sublinear regret bound, and experiments using Qwen2.5-7B with 15 real-world LoRA adapters demonstrate that POLAR substantially outperforms non-adaptive baselines, with empirical scaling behavior aligning closely with theoretical predictions.
📝 Abstract
Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve $\widetilde{\mathcal{O}}(d\sqrt{NT}+\sqrt{KT})$ sublinear regret under stochastic regularity and cacheability conditions, where $N$ is the adapter count, $K$ the cache size, $d$ the context dimension, and $T$ the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.