🤖 AI Summary
Large language models (LLMs) face a fundamental trade-off between inference accuracy and energy consumption.
Method: This paper proposes test-time compute (TTC) as an energy-efficient alternative to conventional model scaling, integrating empirical energy modeling, complexity-aware dynamic computation scheduling, and multi-task benchmark evaluation.
Contribution/Results: We systematically demonstrate—empirically for the first time—that TTC substantially outperforms model scaling on complex reasoning tasks: at equal accuracy, it reduces energy per inference by 37%. TTC’s efficacy scales strongly with output length and enables query-aware, adaptive resource allocation based on input complexity. Crucially, it requires no additional pretraining and can be deployed off-the-shelf to improve the accuracy-per-joule ratio during inference. This work establishes a practical, deployable pathway toward green AI inference.
📝 Abstract
Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work introduces test-time compute (TTC)-allocating additional computational resources during inference-as a compelling complement to conventional scaling strategies. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models without incurring additional pretraining costs.