🤖 AI Summary
To address the challenge of deploying large language models (LLMs) on resource-constrained devices, this paper proposes the first efficient Transformer architecture integrating weight binarization with multi-stage early exit. Methodologically, it introduces a differentiable second-order spike approximation function to enable magnitude-aware gradient updates during binarization, and a soft routing mechanism based on entropy reduction—replacing hard-threshold early exit—to mitigate “overthinking.” The architecture supports end-to-end training without knowledge distillation. Experimental results demonstrate an 18.44× parameter compression, a 54.85% reduction in inference FLOPs, and a 5.98% improvement in GLUE accuracy, achieving Pareto-optimal trade-offs between efficiency and accuracy.
📝 Abstract
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements make deployment on devices with constrained resources extremely difficult. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. However, binarization may lead to performance loss due to reduced precision affecting gradient estimation and parameter updates. Besides, the present early-exit mechanisms are still in the nascent stages of research. To ameliorate these issues, we propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture to combine early exit with binarization for textual inference. It improves the binarization process through a differentiable second-order approximation to the impulse function. This enables gradient computation concerning both the sign as well as the magnitude of the weights. In contrast to absolute threshold-based EE, the proposed EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. While binarization results in 18.44 times reduction in model size, early exit reduces the FLOPs during inference by 54.85% and even improves accuracy by 5.98% through resolving the"overthinking"problem inherent in deep networks. Moreover, the proposed BEExformer simplifies training by not requiring knowledge distillation from a full-precision LLM. Extensive evaluation on the GLUE dataset and comparison with the SOTA works showcase its pareto-optimal performance-efficiency trade-off.