🤖 AI Summary
On integrated NPUs, multi-tenant DNN workloads sharing on-chip caches suffer from unpredictable cache conflicts and low utilization. Method: This paper proposes a hardware–software co-design approach: (1) a novel cache partitioning mechanism enabling both model-exclusive and NPU-controllable shared caching; and (2) a joint scheduling algorithm combining capacity-aware static mapping with runtime dynamic quota adjustment. Contributions/Results: The hardware implementation is lightweight and scalable; the software scheduler ensures fairness and efficiency. Experiments show an average 33.4% reduction in memory accesses, up to 2.56× single-model speedup, and 1.88× average speedup across workloads—significantly improving cache efficiency and system throughput in multi-tenant scenarios.
📝 Abstract
With the rapid development of DNN applications, multi-tenant execution, where multiple DNNs are co-located on a single SoC, is becoming a prevailing trend. Although many methods are proposed in prior works to improve multi-tenant performance, the impact of shared cache is not well studied. This paper proposes CaMDN, an architecture-scheduling co-design to enhance cache efficiency for multi-tenant DNNs on integrated NPUs. Specifically, a lightweight architecture is proposed to support model-exclusive, NPU-controlled regions inside shared cache to eliminate unexpected cache contention. Moreover, a cache scheduling method is proposed to improve shared cache utilization. In particular, it includes a cache-aware mapping method for adaptability to the varying available cache capacity and a dynamic allocation algorithm to adjust the usage among co-located DNNs at runtime. Compared to prior works, CaMDN reduces the memory access by 33.4% on average and achieves a model speedup of up to 2.56$ imes$ (1.88$ imes$ on average).