OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional quantization methods lack flexibility in switching bit-widths between fine-tuning and deployment, leading to poor accuracy adaptability for on-device multi-task scenarios (e.g., understanding and generation). This paper proposes the first robust quantization framework enabling multi-precision inference after a single fine-tuning pass. Its core contributions are: (1) Shared Exponent Floating-Point (SEFP) quantization, unifying scale representations across varying bit-widths; (2) Bit-width Path Search (BPS), a strategy balancing exploration and exploitation to identify optimal precision configurations; and (3) Low-Precision Asynchronous Accumulation (LAA), mitigating error propagation across mixed-precision layers. Evaluated on LLaMA3.2-1B and LLaMA3-8B, the framework achieves consistently high performance and strong robustness across all target bit-widths (2–8 bits). It significantly enhances flexibility and efficiency for edge deployment without requiring task-specific retraining or separate quantization pipelines.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.
Problem

Research questions and friction points this paper is trying to address.

Enables flexible quantization precision switching for on-device LLMs
Maintains performance robustness across different bit-widths through single tuning
Overcomes structural limitations of conventional quantization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared Exponent Floating Point enables flexible bit-width switching
Exploitation-Exploration Bit-Width Path Search optimizes precision selection
Low-Precision Asynchronous Accumulation handles gradient updates efficiently
🔎 Similar Papers
No similar papers found.
S
Shaoyuan Chen
Houmo AI, Sun Yat-sen University
Z
Zhixuan Chen
Houmo AI
D
Dawei Yang
Houmo AI
Zhihang Yuan
Zhihang Yuan
Bytedance
Efficient AIModel CompressionMLLM
Q
Qiang Wu
Houmo AI