MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Long-context LLM inference faces severe GPU memory bottlenecks, particularly during the decode phase where KV cache consumption dominates memory usage and hinders practical deployment. This work proposes a memory-efficient inference framework based on mini-sequence partitioning and hierarchical KV cache offloading—marking the first effort to shift optimization focus from prefill to decode, thereby eliminating its inherent memory bottleneck. Our approach comprises three key components: lightweight last-layer computation scheduling, memory-aware inference scheduling, and a coordinated offloading mechanism. Experiments demonstrate an average reduction of over 50% in peak GPU memory usage. On a single A100 80GB GPU, Llama-3.2-8B supports contexts up to 455k tokens—a 194% improvement over baseline—and achieves a 35% gain in context scalability compared to conventional chunked-prefill methods, all while preserving output fidelity, model accuracy, and high throughput.

Technology Category

Application Category

📝 Abstract

Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller"mini-sequences"and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.

Problem

Research questions and friction points this paper is trying to address.

Reduces GPU memory usage in long-context language models

Extends maximum context length without accuracy loss

Eliminates prefill memory as the main inference bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitions layers into mini-sequences for efficiency

Integrates KV cache offloading seamlessly

Reduces peak memory usage by 50%

🔎 Similar Papers

No similar papers found.