Leveraging the true depth of LLMs

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the high computational depth and energy consumption in inference for pre-trained large language models (LLMs), this paper proposes a retraining-free layer-pairing parallel inference method. By analyzing redundancy and decoupling properties among intermediate layers, we design a static layer importance scoring mechanism and a dynamic layer grouping strategy to reconstruct the forward-pass computation graph, enabling cross-layer operator-level parallelism. Our key contribution is the first practical translation of inter-layer decoupling into an inference acceleration mechanism—achieving a favorable trade-off between accuracy and efficiency. Experiments across mainstream LLMs demonstrate an average 1.20× improvement in generation throughput, while preserving 95%–99% of the original accuracy. The method significantly reduces service latency and GPU resource utilization without requiring model retraining or architectural modification.

Technology Category

Application Category

📝 Abstract

Large Language Models demonstrate remarkable capabilities at the cost of high compute requirements. While recent research has shown that intermediate layers can be removed or have their order shuffled without impacting performance significantly, these findings have not been employed to reduce the computational cost of inference. We investigate several potential ways to reduce the depth of pre-trained LLMs without significantly affecting performance. Leveraging our insights, we present a novel approach that exploits this decoupling between layers by grouping some of them into pairs that can be evaluated in parallel. This modification of the computational graph -- through better parallelism -- results in an average improvement of around 1.20x on the number of tokens generated per second, without re-training nor fine-tuning, while retaining 95%-99% of the original accuracy. Empirical evaluation demonstrates that this approach significantly improves serving efficiency while maintaining model performance, offering a practical improvement for large-scale LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost of LLMs

Maintain model performance with reduced depth

Improve serving efficiency through parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel layer evaluation

Reduced computational depth

Maintained model accuracy

🔎 Similar Papers

No similar papers found.