🤖 AI Summary
This study addresses the high energy consumption challenges of deploying large language models (LLMs) in industrial settings, where existing energy-saving techniques lack empirical validation under real-world conditions. In a production-grade chatbot environment, we present the first systematic evaluation of four green LLM approaches—small-large model collaboration (NPCC), prompt optimization, 2-bit quantization, and batching—examining their trade-offs among energy efficiency, accuracy, and response latency. Our findings reveal that only NPCC achieves substantial energy reduction without compromising model performance; in contrast, the other methods, while capable of cutting energy use by up to 90%, incur severe accuracy degradation. This work provides empirical evidence and practical guidance for optimizing the energy efficiency of industrial-scale LLM deployments.
📝 Abstract
The rapid adoption of large language models (LLMs) has raised concerns about their substantial energy consumption, especially when deployed at industry scale. While several techniques have been proposed to address this, limited empirical evidence exists regarding the effectiveness of applying them to LLM-based industry applications. To fill this gap, we analyzed a chatbot application in an industrial context at Schuberg Philis, a Dutch IT services company. We then selected four techniques, namely Small and Large Model Collaboration, Prompt Optimization, Quantization, and Batching, applied them to the application in eight variations, and then conducted experiments to study their impact on energy consumption, accuracy, and response time compared to the unoptimized baseline. Our results show that several techniques, such as Prompt Optimization and 2-bit Quantization, managed to reduce energy use significantly, sometimes by up to 90%. However, these techniques especially impacted accuracy negatively, to a degree that is not acceptable in practice. The only technique that achieved significant and strong energy reductions without harming the other qualities substantially was Small and Large Model Collaboration via Nvidia's Prompt Task and Complexity Classifier (NPCC) with prompt complexity thresholds. This highlights that reducing the energy consumption of LLM-based applications is not difficult in practice. However, improving their energy efficiency, i.e., reducing energy use without harming other qualities, remains challenging. Our study provides practical insights to move towards this goal.