Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of quantization-induced accuracy degradation and high inference latency variance in deploying deep neural networks on resource-constrained devices, this paper proposes a Bayesian epistemic uncertainty-driven “curiosity-aware” quantized Mixture-of-Experts (MoE) architecture. The method integrates heterogeneous experts—namely, BitNet-based ternarization, 1–16-bit BitLinear, and post-training quantization—with an information-theoretic adaptive routing mechanism to dynamically allocate computational load. Evaluated on multiple audio classification benchmarks, the 4-bit quantized model achieves an F1 score of 0.858—retaining 99.9% of full-precision performance—while delivering a 4× model size reduction, 41% energy savings, and a dramatic 87% reduction in inference latency standard deviation (from 230 ms to 29 ms). These results demonstrate substantial improvements in accuracy, efficiency, and inference determinism for edge deployment.

Technology Category

Application Category

📝 Abstract
Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene's test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p > 0.05), while MoE architectures introduce 11 percent latency overhead (p < 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.
Problem

Research questions and friction points this paper is trying to address.

Deploying neural networks on resource-limited devices with accuracy preservation
Ensuring predictable inference latency under aggressive quantization constraints
Achieving energy-efficient edge computing while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian uncertainty routing across heterogeneous experts
Four-bit quantization maintains near-full precision accuracy
Curiosity-driven routing reduces latency variance by 82 percent
🔎 Similar Papers
No similar papers found.
Sebastián Andrés Cajas Ordóñez
Sebastián Andrés Cajas Ordóñez
Harvard University
mhealthdeep learningcomputer visionaerospace
L
Luis Fernando Torres Torres
Université de Rennes, France
M
Mackenzie J. Meni
Technetium Engineering, Florida, USA
C
Carlos Andrés Duran Paredes
Institución Universitaria Colegio Mayor del Cauca, Colombia
E
Eric Arazo
CeADAR - Ireland’s Centre for AI, University College Dublin, Ireland
C
Cristian Bosch
CeADAR - Ireland’s Centre for AI, University College Dublin, Ireland
R
Ricardo Simon Carbajo
CeADAR - Ireland’s Centre for AI, University College Dublin, Ireland
Yuan Lai
Yuan Lai
Tsinghua University Asst. Professor in Urban Science and Planning
Urban ScienceUrban InformaticsDigital HealthSmart Cities
Leo Anthony Celi
Leo Anthony Celi
Massachusetts Institute of Technology