Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

278K/year

🤖 AI Summary

To address the dual challenges of quantization-induced accuracy degradation and high inference latency variance in deploying deep neural networks on resource-constrained devices, this paper proposes a Bayesian epistemic uncertainty-driven “curiosity-aware” quantized Mixture-of-Experts (MoE) architecture. The method integrates heterogeneous experts—namely, BitNet-based ternarization, 1–16-bit BitLinear, and post-training quantization—with an information-theoretic adaptive routing mechanism to dynamically allocate computational load. Evaluated on multiple audio classification benchmarks, the 4-bit quantized model achieves an F1 score of 0.858—retaining 99.9% of full-precision performance—while delivering a 4× model size reduction, 41% energy savings, and a dramatic 87% reduction in inference latency standard deviation (from 230 ms to 29 ms). These results demonstrate substantial improvements in accuracy, efficiency, and inference determinism for edge deployment.

Technology Category

Application Category

📝 Abstract

Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene's test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.

Problem

Research questions and friction points this paper is trying to address.

Deploying neural networks on resource-limited devices with accuracy preservation

Ensuring predictable inference latency under aggressive quantization constraints

Achieving energy-efficient edge computing while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian uncertainty routing across heterogeneous experts

Four-bit quantization maintains near-full precision accuracy

Curiosity-driven routing reduces latency variance by 82 percent

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL