Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Traditional Transformers incur prohibitively high computational costs and resource consumption during large-scale training and deployment. Method: This paper systematically constructs an efficient architecture framework for large language models (LLMs), proposing a unified taxonomy that integrates linearized sequence modeling, sparse attention mechanisms, Mixture-of-Experts (MoE) architectures, diffusion-based language modeling, and multimodal transfer techniques. It innovatively unifies sparsification, linearization, and hybrid modeling paradigms to jointly ensure theoretical interpretability and engineering practicality. Contribution/Results: We present the first comprehensive architectural landscape of efficient LLMs—spanning training, inference, and multimodal extension—providing a systematic blueprint for scalable foundation model design under resource constraints. Our framework significantly reduces computational and memory overhead, enabling the practical deployment of high-performance, low-resource AI systems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

Problem

Research questions and friction points this paper is trying to address.

Survey explores efficient architectures to overcome transformer computation limitations

Examines linear, sparse, and hybrid methods for scalable LLM efficiency

Analyzes techniques for resource-aware, multimodal foundation model development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear and sparse sequence modeling methods

Efficient full attention variants

Hybrid model architectures with MoE

🔎 Similar Papers

No similar papers found.