🤖 AI Summary
Traditional quantization methods lack flexibility in switching bit-widths between fine-tuning and deployment, leading to poor accuracy adaptability for on-device multi-task scenarios (e.g., understanding and generation). This paper proposes the first robust quantization framework enabling multi-precision inference after a single fine-tuning pass. Its core contributions are: (1) Shared Exponent Floating-Point (SEFP) quantization, unifying scale representations across varying bit-widths; (2) Bit-width Path Search (BPS), a strategy balancing exploration and exploitation to identify optimal precision configurations; and (3) Low-Precision Asynchronous Accumulation (LAA), mitigating error propagation across mixed-precision layers. Evaluated on LLaMA3.2-1B and LLaMA3-8B, the framework achieves consistently high performance and strong robustness across all target bit-widths (2–8 bits). It significantly enhances flexibility and efficiency for edge deployment without requiring task-specific retraining or separate quantization pipelines.
📝 Abstract
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.