VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

253K/year
πŸ€– AI Summary
This work addresses the inefficiency of deploying vision-language mixture-of-experts (VL-MoE) models on memory-constrained devices, where dense visual inputs lead to widespread and unpredictable expert activation. To tackle this, we introduce the novel concept of β€œvision-expert affinity,” demonstrating that pruning visual tokens not only reduces computational cost but also concentrates and stabilizes expert access patterns. Building on this insight, we propose an affinity-aware token compression strategy, a lookahead expert prediction mechanism, and a coordinated caching and pipelining schedule, collectively enhancing expert locality and prefetching efficiency. Evaluated across multiple VL-MoE architectures, our approach achieves up to 2.68Γ— faster end-to-end inference on mobile devices and 1.61Γ— speedup on edge devices over strong baselines, all while preserving model accuracy.
πŸ“ Abstract
Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redundant visual tokens can improve offloading not only by reducing computation, but also by reshaping expert demand. We refer to this effect as \textit{visual-expert affinity}: token pruning makes expert accesses more concentrated within layers and more stable across layers, producing a smaller and more predictable expert working set. Guided by this insight, VisMMoE combines affinity-aware token compression, lookahead expert prediction, and cache/pipeline orchestration to improve expert locality and prefetch effectiveness under tight memory budgets. We implement VisMMoE on multiple frameworks and evaluate it on representative VL-MoE models and benchmarks. VisMMoE improves end-to-end inference performance by up to 2.68x and 1.61x, respectively, over strong baselines for today's VL-MoE deployments while maintaining competitive accuracy.
Problem

Research questions and friction points this paper is trying to address.

vision-language MoE
model offloading
memory-constrained deployment
visual tokens
expert access unpredictability
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-expert affinity
MoE offloading
token pruning
expert locality
vision-language models
πŸ”Ž Similar Papers