🤖 AI Summary
While large language models have demonstrated remarkable engineering success, they remain theoretically underdeveloped and mechanistically opaque—essentially operating as “black boxes.” This work proposes the first unified theoretical framework encompassing the entire lifecycle of large language models, systematically analyzing the core mechanisms across six stages: data preparation, model construction, training, alignment, inference, and evaluation. By integrating information theory, optimization theory, and representation learning, the framework elucidates the mathematical principles underlying critical issues such as data mixing strategies, architectural expressivity, and alignment optimization. Furthermore, it identifies forward-looking challenges including self-improving synthetic data generation, safety boundaries, and the origins of emergent intelligence. This study provides a structured roadmap toward transforming large language models from empirical engineering artifacts into an explainable, predictable, and verifiable scientific discipline.
📝 Abstract
The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.