Compute Where it Counts: Self Optimizing Language Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the inefficiency of static computation budgets in conventional large language model inference, which fail to adapt to varying decoding-step difficulties. The authors propose Self-Optimizing Language models (SOL), which integrate a lightweight policy network into a frozen backbone to dynamically allocate computational resources per decoding step. SOL is the first framework to jointly optimize attention sparsity, structured MLP pruning, and activation quantization bitwidth within a unified architecture. The optimal allocation policy is learned via counterfactual scheduling and group-relative policy optimization. Experiments demonstrate that SOL significantly outperforms both static and random scheduling baselines under identical computational budgets, achieving up to a 7.3% absolute improvement in MMLU accuracy and advancing the Pareto frontier of quality-efficiency trade-offs across the board.

📝 Abstract

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.

Problem

Research questions and friction points this paper is trying to address.

efficient inference

dynamic computation

token-level budgeting

language models

compute allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic budget allocation

self-optimizing language models

adaptive inference