🤖 AI Summary
This work challenges the prevailing assumption that lower-precision quantization universally improves efficiency in neural inference by revealing a “quantization trap” in multi-hop reasoning tasks. Through theoretical modeling, hardware energy profiling, and empirical experiments, the study demonstrates that reducing precision from 16-bit to 8- or 4-bit not only degrades accuracy but also increases end-to-end energy consumption due to overhead from hardware format conversions and dequantization kernel latency. These findings contradict conventional neural scaling laws and the industry’s “smaller-is-better” paradigm, showing that linear scaling assumptions fail in complex reasoning scenarios where computational and memory-access patterns interact nontrivially with quantization-induced inefficiencies.
📝 Abstract
Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a'quantization trap'where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's"smaller-is-better"heuristic is mathematically counterproductive for complex reasoning tasks.