Native LLM and MLLM Inference at Scale on Apple Silicon

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

This work addresses the lack of efficient native inference support for large language models (LLMs) and multimodal large language models (MLLMs) on Apple Silicon. We present vllm-mlx, a framework built upon MLX that unifies high-performance LLM and MLLM inference on Apple Silicon for the first time. Our key innovations include a content-aware visual prefix caching mechanism—leveraging content hashing to eliminate redundant image encoding overhead—alongside continuous batching and unified memory architecture optimizations. Experimental results demonstrate significant performance gains: text-only model throughput improves by 21%–87%, with aggregate throughput reaching 4.3× under 16 concurrent requests; repeated image queries achieve a 28× speedup, and multi-frame video analysis accelerates by 24.7×. The project is publicly open-sourced.

Technology Category

Application Category

📝 Abstract

The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models, leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21\% to 87\% higher throughput than llama-cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max demonstrates throughput of up to 525 tokens per second on text models and 28x speedup on repeated image queries, reducing multimodal latency from 21.7 seconds to under 1 second. Video analysis with up to 64 frames achieves 24.7x cache speedup. We release our implementation as open source to support efficient inference on consumer Apple hardware.

Problem

Research questions and friction points this paper is trying to address.

Apple Silicon

LLM inference

MLLM inference

multimodal workloads

unified memory architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Apple Silicon

multimodal LLM

content-based prefix caching