Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) deployed in production face critical security and reliability challenges, including vulnerability to jailbreaking attacks and prompt injection, inaccurate intent recognition in vertical domains, and uncontrollable response generation. Method: This paper proposes Archias, a lightweight domain-expert model integrated within a collaborative defense framework. Archias introduces fine-grained input classification—distinguishing malicious inputs, prompt injections, out-of-domain queries, etc.—and dynamically incorporates classification results into LLM prompts to enable security-aware reasoning enhancement. The methodology encompasses multi-class intent modeling, security-driven prompt augmentation, domain-specific benchmark construction, and fine-tuning-integrated inference. Contribution/Results: Evaluated on a custom automotive-domain security benchmark, Archias significantly improves jailbreak resistance. Its compact size and rapid domain adaptation enable cross-industry deployment. Furthermore, we publicly release the benchmark dataset to advance community research on secure, domain-specialized LLMs.

Technology Category

Application Category

📝 Abstract
Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient, as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient or reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model (Archias) into prompts, which are then processed by the LLM to generate responses. This method increases the model's ability to understand the user's intention and give appropriate answers. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size. Therefore, it can be easily customized to the needs of any industry. To validate our approach, we created a benchmark dataset for the automotive industry. Furthermore, in the interest of advancing research and development, we release our benchmark dataset to the community.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM security against evolving jailbreak and prompt injection attacks
Addressing domain-specific relevance gaps in LLM responses for industries
Providing flexible, customizable expert models for industry-specific LLM behavior tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert model Archias classifies user inquiries
Integrates expert model outputs into LLM prompts
Customizable for various industry-specific needs
🔎 Similar Papers
No similar papers found.
Tatia Tsmindashvili
Tatia Tsmindashvili
Unknown affiliation
LLMNeuroscience
A
Ana Kolkhidashvili
Impel, 13202 Syracuse, United States
D
Dachi Kurtskhalia
Impel, 13202 Syracuse, United States
E
Elene Mekvabishvili
Impel, 13202 Syracuse, United States
G
Guram Dentoshvili
Impel, 13202 Syracuse, United States
Z
Zaal Gachechiladze
Impel, 13202 Syracuse, United States
S
Steven Saporta
Impel, 13202 Syracuse, United States