Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference costs, and existing dynamic token-level computation allocation methods rely on greedy routing, leading to irreversible information loss and suboptimal decisions. Method: This paper proposes a recoverability-aware token-level computation allocation mechanism. It introduces a lightweight feature predictor (LFF) to estimate neuron outputs in the forward pass and dynamically determines, per token, whether to perform exact or efficient approximate computation based on a recoverability criterion—thereby overcoming the limitations of greedy routing. Contribution/Results: This work is the first to incorporate recoverability modeling into token-level computation scheduling, establishing a novel “execute/approximate” paradigm. It achieves optimal accuracy–efficiency trade-offs across multiple sparsity levels and surpasses strong fully fine-tuned baselines—even without LoRA-based adaptation—while reducing training time by over 50%.

Technology Category

Application Category

📝 Abstract
The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM inference costs through dynamic token-level computation allocation
Overcoming greedy routing's irreversible information loss and suboptimal selection
Enabling flexible execute-or-approximate decisions while preserving model fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Feature Forecaster predicts unit outputs
Execute-or-approximate policy preserves model fidelity
Assesses token importance and recoverability for routing
🔎 Similar Papers
No similar papers found.
Chao Han
Chao Han
PhD, Electrical Engineering, California Institute of Technology
Optical imaging systemsMEMSBiomedical microdevices
Y
Yijuan Liang
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Z
Zihao Xuan
The Hong Kong University of Science and Technology
D
Daokuan Wu
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
W
Wei Zhang
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning