Is (Selective) Round-To-Nearest Quantization All You Need?

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work challenges the prevailing consensus that round-to-nearest (RTN) quantization is obsolete, conducting a systematic evaluation of its practicality in large language model (LLM) deployment. We propose selective RTN: a layer-wise precision allocation strategy—e.g., FP16/INT4 hybrid—that dynamically assigns bit-widths based on layer importance, achieving Pareto-optimal trade-offs between accuracy and computational overhead. Integrated with the Marlin kernel, it enables lossless weight loading and low-latency inference. Experiments across mainstream LLMs show that our method matches state-of-the-art quantization schemes in accuracy while improving token throughput by up to 1.8×, significantly reducing GPU memory footprint and end-to-end latency. Our core contribution is the reestablishment of RTN as a lightweight, efficient, and practically viable quantization baseline, complemented by a hierarchical precision scheduling framework that jointly optimizes inference performance and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Round-To-Nearest (RTN) quantization for Large Language Models (LLMs)

Comparing RTN's cost and performance with advanced quantization methods)

Improving RTN accuracy via selective precision enhancement in model layers)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Round-To-Nearest (RTN) quantization for LLMs

Implements RTN via Marlin kernels for efficiency

Selectively increases precision for better accuracy

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

2024-09-25arXiv.orgCitations: 19

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow