Problem
Research questions and friction points this paper is trying to address.
Reduces KV cache memory in LLMs for efficiency
Prunes unimportant K cache channels via learnable masks
Accelerates decoding without accuracy loss
Innovation
Methods, ideas, or system contributions that make the work stand out.
Learning-based K cache channel pruning
Two-stage training for static masks
Custom kernel speeds up attention computation