HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite alignment, compact large language models remain vulnerable to jailbreaking attacks; existing adversarial prompts often suffer from semantic fragmentation and are easily detected by perplexity-based safety filters. To address this, we propose a stealthy and efficient automated red-teaming framework. Our method employs a multi-stage evolutionary search strategy integrating population-based optimization, temperature-controlled diverse prompt generation, multilingual adversarial sample construction, and perplexity-aware evasion mechanisms. Crucially, we introduce the first temperature-controllable evolutionary algorithm coupled with explicit semantic coherence constraints, enabling high-success-rate, natural-sounding jailbreak prompts. Experiments on English and a newly constructed Arabic benchmark demonstrate that our approach significantly improves attack success rates while generating fluent, human-like prompts that evade state-of-the-art safety detectors. This work establishes a novel evaluation paradigm for assessing the robustness of compact LLMs against jailbreaking.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.
Problem

Research questions and friction points this paper is trying to address.

Automated generation of stealthy jailbreak prompts for compact LLMs
Bypassing alignment safeguards while maintaining natural language fluency
Multilingual evaluation including English and Arabic benchmark datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evolutionary search for jailbreak prompts
Temperature-controlled variability for coherence preservation
Multilingual evaluation with English and Arabic benchmarks
🔎 Similar Papers
No similar papers found.
A
Alexey Krylov
MIPT, Sberbank
I
Iskander Vagizov
MIPT, Sberbank
Dmitrii Korzh
Dmitrii Korzh
MTUCI
RobustnessAudio
M
Maryam Douiba
Cadi Ayyad University
A
Azidine Guezzaz
Cadi Ayyad University
Vladimir Kokh
Vladimir Kokh
Sberbank
S
Sergey D. Erokhin
MTUCI
E
Elena V. Tutubalina
MIPT, Sberbank, ISP RAS
Oleg Y. Rogov
Oleg Y. Rogov
University of Sharjah, MTUCI
aimultimodal learningreasoningsafe ai