🤖 AI Summary
This work addresses the vulnerability of large language models to malicious prompt attacks, a challenge inadequately mitigated by existing defenses due to limitations in transparency, computational overhead, or adaptability. The authors propose BAGEL, a lightweight ensemble framework that integrates prompt-guided aggregation with mixture-of-experts principles, comprising multiple fine-tuned small safety classifiers (each with 86M parameters). Efficient detection and interpretability are achieved through random forest-based routing and stochastic prediction sampling. With only five ensemble members (totaling 430M parameters), BAGEL attains an F1 score of 0.92 on malicious prompt detection—significantly outperforming billion-parameter baselines such as OpenAI Moderation API and ShieldGemma. Moreover, the framework supports incremental updates without retraining, maintaining stable performance over nine successive updates.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.